# Introduction

The machine learning life cycle is the cyclical process that data science projects follow. It defines each step that an organization should follow to take advantage of machine learning and artificial intelligence (AI) to derive practical business value.

Why is the Machine Learning Life Cycle Important?

The machine learning life cycle is important because it delineates the role of every person in a company in data science initiatives, ranging from business to engineering personnel. It takes each and every project from inception to completion and gives a high-level perspective of how an entire data science project should be structured in order to result in real, practical business value. Failing to accurately execute on any one of these steps will result in misleading insights or models with no practical value.


There are five major steps in the machine learning life cycle, all of which have equal importance and go in a specific order.

Break down how machine learning models are built into five steps: 
1. [Acquire and Explore Data](#Data)
2. [Data preparation and exploration](#analysis)
3. Model build and train 
4. Interpret and Communicate
5. Implement, Document, and Maintain


![Machine Learning Lifecycle](images\ml-lifecycle.png)


## Setup

In [None]:
import os
import glob
import pandas as pd
import numpy as np
import filenames
import missingno as msno
import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 500, 'display.max_columns', 500,
              'display.width', 1000)

## Data

1. Data Collection — Collect as much raw data as possible regardless of quality In the end, only a small subset of it will be annotated anyway which is where most of the cost comes from. It is useful to have a lot of data available to add as needed when problems arise with model performance.

2. Define your annotation schema — This is one of the most important parts of the data phase of the lifecycle, and it often gets overlooked. A poorly constructed annotation schema will result in ambiguous classes and edge cases that make it much more difficult to train a model.

3. Data Annotation—Annotation is a tedious process of performing the same task on and on for hours at a time, which is why annotation services are a booming business. The result is that annotators will likely make numerous mistakes. While most annotation firms guarantee a maximum error percentage (ex. 2% max error), a larger problem is a poorly defined annotation schema resulting in annotators deciding to label samples differently. This is harder to spot by the QA team of an annotation firm and is something that you need to check yourself.

4. Improve dataset and annotations — You will likely spend the majority of your time here when trying to improve model performance. If your model is learning but not performing well, the culprit is almost always a training dataset containing biases and mistakes that are creating a performance ceiling for your model. Improving your model generally involves things like hard sample mining (adding new training data similar to other samples the model failed on), rebalancing your dataset based on biases your model has learned, and updating your annotations and schema to add new labels and refine existing ones.

### You use glob to get all the csv files in the raw data folder.

In [None]:
profile_files = filenames.profile_folder_path.glob(os.path.join("*.csv"))

profile_appended_data = []
# loop over the list of csv files
for f in profile_files:
    data = pd.read_csv(f)
    profile_appended_data.append(data)
#profile_appended_data

df = pd.concat(profile_appended_data)
df.reset_index(drop=True, inplace=True)

#### Drop duplicate userid

In [None]:
df = df.drop_duplicates(subset=['userid'], keep='last').reset_index(drop=True)

#### Create Label for Followers

In [None]:
import csv
fpath = filenames.followers_path
follower = []
with open(fpath, newline='') as f:
    for i in csv.reader(f):
        follower.append(i[0])

In [None]:
len(follower)

In [None]:
df['is_follower'] = df['username'].isin(follower).astype(int)

In [None]:
df

#### Username of extracted profiles to date

In [None]:
#df['username'].to_csv(filenames.todate_path, index = False)

#### Extract Usernames of similar content for traveltrackie

In [None]:
# stripping and splitting the list within a string without inverted commas
# similar_accounts = (
#     df.iloc[0]['similar_accounts']).strip("'['").rstrip("']'").split(",")

# similar_accounts_list = []
# for user in similar_accounts:
#     similar_accounts_list.append(user.split(" ")[2])

# print("Storing similar accounts into file.")
# with open(similar_accounts_path, 'w') as f:
#     for similar_account in similar_accounts_list:
#         print(similar_account, file=f)

## Analysis

After getting the data, data scientists have to prepare the raw data, perform data exploration, visualize data, transform data and possibly repeat the steps until it’s ready to use for modeling. Data preparation is cleansing and processing raw data before analysis. Before building any machine learning model, data scientists need to understand the available data. 

Raw data can be messy, duplicated or inaccurate. Data scientists explore the data available to them, then cleanse the data by identifying corrupt, inaccurate and incomplete data and replacing or deleting it.
In addition, data scientists need to determine if the data has labels or not. For example, if you
have a series of images and you want to develop a detection model to determine whether there is a car in the image, you need to have a set of images labeled whether there is a car in them and most likely need bounding boxes around the cars in the images. If the images lack labels, data scientists will have to label them. There are open source tools and commercial vendors that provide platforms for data labeling, as well as human labelers for hire. 

After data is cleansed, data scientists explore the features (or the variables) in their dataset, identify any relationship between the features transformations. There are various tools data
scientists can use for exploratory data analysis in open source libraries and analytics/data science
platforms. A tool that performs statistical analysis of the dataset and creates data visualizations to
generate plots of the features is useful in this step.

It is important to see what types of features are in the dataset. Features can be numerical, which
can be a floating point or integer. Categorical features have a finite number of possible values,
typically assigning data into groups. For example, if you have a dataset from a customer survey,
the respondent’s gender (male or female) is a categorical feature. Ordinal features are a
categorical feature with a set order or scale. 

For example, customer satisfaction response: very satisified, satisfied, indifferent, dissatisfied, and very dissatisfied has a set order to it. You can convert that ordering into an integer scale (1->5).
After determining what kind of features there are, obtaining a distribution of values that each of the
feature has and getting summary statistics of each feature would be next. Doing so would help answer
the following questions about the dataset:

* Is the dataset skewed towards a range of values or a subset of categories?
* What are the minimum, maximum, mean, median and mode values of the feature?
* Are there missing values or invalid values such as null? If so, how many are there?
* Are there outliers in the dataset?

During the data exploration step, it is helpful to plot the features and also plot the features
against each other to identify patterns in the dataset. This helps to determine the need for data
transformation. Some of the questions you need to answer are:

* How do you handle missing values? Do you want to fill in the values and if so, what approach do you plan to take to fill in for the missing value? Some approaches include taking the mean value, the median, the mode, nearby entry’s value and average of nearby entries’ values.
* How will you handle outliers?
* Are some of your features correlated with each other?
* Do you need to normalize the dataset or perform some other transformation to rescale the
data (e.g. log transformation)?
* What is your approach to a long tail of categorical values? Do you use them as-is, group them
in some meaningful way or ignore a subset of them altogether? 

During the data exploration step, you can identify patterns in your dataset for ideas about how to 
develop new features that would better represent the dataset. This is known as feature engineering. 
For example, if you have a traffic dataset for the number of vehicles passing through a major 
intersection at every hour, you might want to create a new feature categorizing the hour into 
different parts of the day, such as early morning, mid-morning, early afternoon, late afternoon, 
and nighttime.

For categorical features, often it is necessary to one hot encode the feature. One hot encoding 
means turning a categorical feature into binary features, one for each of the categories. 
For example, suppose you have a dataset of customers, and we have a feature on which states 
the customer comes from: Washington, Oregon, 
— 0.30 
— 0.25
— 0.20 
— 0.15
— 0.10
— 0.05
— 0.00 
Heatmap of how correlated the features are to each other, from a dataset with three types of wine and features of each wine. and California. One hot encoding would produce two binary features where one feature is whether a customer is from Washington state or not, and the second feature is whether a customer is from Oregon or not. It is assumed that if the customer is not from Washington or Oregon, he / she would be from California, so there is no need for a third feature.


In [None]:
def describe_dataframe(df=pd.DataFrame()):
    """This function generates descriptive stats of a dataframe
    Args:
        df (dataframe): the dataframe to be analyzed
    Returns:
        None

    """
    print("\n\n")
    print("*" * 30)
    print("About the Data")
    print("*" * 30)

    print("Number of rows::", df.shape[0])
    print("Number of columns::", df.shape[1])
    print("\n")

    print("Column Names::", df.columns.values.tolist())
    print("\n")

    print("Column Data Types::\n", df.dtypes)
    print("\n")

    print("Columns with Missing Values::",
          df.columns[df.isnull().any()].tolist())
    print("\n")

    print("General Stats::")
    print(df.info())
    print("\n")

    print("Summary Stats::")
    print(df.describe())
    print("\n")

    print("Dataframe Sample Rows::")
    display(df.head(3))
    
describe_dataframe(df)

#### Checking Missing Values

In [None]:
msno.matrix(df);

We have external_url and business_category_name with the most values missing. biography and full_name are the next two columns with the most values missing

In [None]:
# The heatmap shows a positive correlation with blue. The darker the shade of blue, the more the correlation.
msno.heatmap(df);

In [None]:
# These bars show the values that are proportional to the non-missing data in the dataset. Along with that, the number of values missing is also shown.
msno.bar(df)

#### QUALITATIVE VARIABLE

##### Counts for multiple columns

In [None]:
df[[
    'is_private', 'followed_by_viewer', 'is_business_account',
    'blocked_by_viewer', 'follows_viewer', 'has_blocked_viewer',
    'has_public_story', 'has_requested_viewer', 'is_verified',
    'requested_by_viewer', 'is_follower'
]].apply(pd.Series.value_counts)

#### Value Count for normalized values

In [None]:
# Pie chart
df['business_category_name'].value_counts(normalize=True)#.plot(kind='pie')


##### Bar graph

In [None]:

df['business_category_name'].value_counts(normalize=True).plot(kind='bar')
plt.show()

#### QUANTITATIVE VARIABLE

##### Optimum Number of Data Bins
With histograms, there are rules for determining the optimum number of bins (classes or intervals) into which a distribution of observations should be grouped. For example, the Sturges rule (1926) considers the optimum number of bins to be:

 k=[1+log2(n)] 

where n is the sample size.

* https://stackoverflow.com/questions/3719631/log-to-the-base-2-in-python/28033134

In [None]:
import math

def optimum_bins(df, col):
    return math.frexp(df[col].count())[1]


def optimum_bins_2(df, col, num):
    return math.frexp(df[df[col] <= num][col].count())[1]

##### Histogram of Media Count

In [None]:
df['mediacount'].hist(bins=optimum_bins(df, 'mediacount'))
plt.show()

In [None]:
# Prettier Histogram
df[df.mediacount <= 1000]["mediacount"].hist(density=True,
                                             bins=optimum_bins_2(
                                                 df, 'mediacount', 1000))
plt.show()

##### Histogram of  IGtv Count

In [None]:
# Histogram
df['igtvcount'].hist(bins=50)
plt.show()

##### Histogram of  Followees

In [None]:
# Histogram
df['followees'].hist(bins=optimum_bins(df, 'followees'))
plt.show()

In [None]:
# Prettier Histogram
df[df.followees <= 4000]["followees"].hist(density=True,
                                           bins=optimum_bins_2(
                                               df, 'followees', 4000))
plt.show()

##### Histogram of  Followers

In [None]:
# Histogram
df['followers'].hist(bins=optimum_bins(df, 'followers'))
plt.show()

In [None]:
# Prettier Histogram
df[df.followers <= 6000]["followers"].hist(density=True,
                                           bins=optimum_bins_2(
                                               df, 'followers', 6000))
plt.show()

#### Detection of Outliers
The outliers in the dataset can be detected by the below methods:

* Z-score
* Scatter Plots
* Interquartile range(IQR)

##### Boxplot

In [None]:
df[['mediacount']].boxplot();

In [None]:
df.columns

In [None]:
df[['mediacount', 'is_follower']].boxplot(by='is_follower');

In [None]:
df[['followers']].boxplot();

In [None]:
# https://www.analyticsvidhya.com/blog/2021/05/feature-engineering-how-to-detect-and-remove-outliers-with-python-code/

In [None]:
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
plt.figure(figsize=(16,5))
plt.subplot(1,2,1)
sns.distplot(df['followers'])
plt.subplot(1,2,2)
sns.distplot(df['mediacount'])
plt.show()

In [None]:
df[['followers', 'is_follower']].boxplot(by='is_follower');

In [None]:
df[['followees']].boxplot();

In [None]:
sns.distplot(df['followees'])

In [None]:
df[['followees', 'is_follower']].boxplot(by='is_follower');

#### Mediacount, Followers and Followees all are right skewed. WE are going to use IQR method to remove the outliers in the data

### Other columns

In [None]:
list(df['profile_pic_url'][0:3])

In [None]:
df.columns

In [None]:
list(df['external_url'].dropna()[0:6])

### Dropping Variables

We do not need 'followed_by_viewer',  'blocked_by_viewer', 'follows_viewer', 'has_blocked_viewer', 'has_requested_viewer', 'is_verified', 'requested_by_viewer' as these have no variability. We also do not need userid, igtvcount, and  similar accounts. There is no information from profile_pic_url

In [None]:
df.drop([
    'userid', 'followed_by_viewer', 'igtvcount', 'blocked_by_viewer',
    'follows_viewer', 'has_blocked_viewer', 'has_requested_viewer',
    'is_verified', 'requested_by_viewer', 'profile_pic_url', 'similar_accounts'
],
        axis=1,
        inplace=True)
