# Machine Learning Workshop - Introduction

## What is data science?

["Data science is the discipline of making data useful. [...] the idea of usefulness is tightly coupled with influencing real-world actions."](https://hackernoon.com/what-on-earth-is-data-science-eb1237d8cb37) (Cassie Kozyrkov - Chief Decision Scientist at Google)

<p align="center">
  <img src="https://dss-www-production.s3.amazonaws.com/uploads/2018/12/Data-Science-fields-600x544.png" width="450" height="400"/>
</p>

<p align="center">
  <img src="https://hackernoon.com/hn-images/1*8Wz6lQ8GFEAvnSS5uqMQ5g.png"/>
</p>

## CRISP-DM (Cross Industry Standard Process for Data Mining)

<p align="center">
  <a href="https://www.datascience-pm.com/crisp-dm-2/">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/440px-CRISP-DM_Process_Diagram.png">
  </a>
</p>

1. Business understanding – What does the business need?
2. Data understanding – What data do we have / need? Is it clean?
3. Data preparation – How do we organize the data for modeling?
4. Modeling – What modeling techniques should we apply?
5. Evaluation – Which model best meets the business objectives?
6. Deployment – How do stakeholders access the results?

## Techniques

<p align="center">
  <img src="https://www.researchgate.net/profile/T-Suryakanthi-2/publication/339639928/figure/fig1/AS:864826486161408@1583202110233/Broad-Classification-of-Machine-Learning-Techniques.png" width="600" height="400">
</p>

<p align="center">
  <img src="https://slidetodoc.com/presentation_image_h/14ad450855ff729f2ab4d8ef873b6ba3/image-8.jpg" width="600" height="400">
</p>

## What is an "algorithm" in machine learning?

An "algorithm" in machine learning is a procedure that is run on data to create a machine learning "model."

Machine learning algorithms perform "pattern recognition." Algorithms "learn" from data, or are "fit" on a dataset.

There are many machine learning algorithms for classification, such as Decision Tree, for regression, such as linear regression, and for clustering, such as k-means.

## What is a "model" in machine learning?

A "model" in machine learning is the output of a machine learning algorithm run on data.

A model represents what was learned by a machine learning algorithm.

The model is the "thing" that is saved after running a machine learning algorithm on training data and represents the rules, numbers, and any other algorithm-specific data structures required to make predictions.

Some examples:

- The linear regression algorithm results in a model comprised of a vector of coefficients with specific values.
- The decision tree algorithm results in a model comprised of a tree of if-then statements with specific values.

## Python library

<p align="center">
  <a href="https://scikit-learn.org/stable/">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/0/05/Scikit_learn_logo_small.svg/500px-Scikit_learn_logo_small.svg.png">
  </a>
</p>

## A simple example

The problem we will solve is to convert from Celsius to Fahrenheit, where the approximate formula is:

$$ f = c \times 1.8 + 32 $$

Of course, it would be simple enough to create a conventional Python function that directly performs this calculation, but that wouldn't be machine learning.

Instead, we will train a model that figures out the above formula through the training process.

Linear regression is a regression model to find a linear function that expresses relationship between dependent and independent variables.

<p align="center">
  <img src="https://miro.medium.com/max/1276/0*CoAF7U14zw5hRgvu.png">
</p>

In [None]:
# Import the libraries
from sklearn.linear_model import LinearRegression
import numpy as np

In [None]:
# Define variables
celsius_X = np.array([-40, -10,  0,  8, 15, 22,  38],
                     dtype=float).reshape(-1, 1)
fahrenheit_Y = np.array([-40, 14, 32, 46, 59, 72, 100], dtype=float)


In [None]:
# Train the algorithm
reg = LinearRegression().fit(celsius_X, fahrenheit_Y)

In [None]:
# Predict
reg.predict(np.array([[100]]))

The correct answer is $100 \times 1.8 + 32 = 212$

In [None]:
print(reg.coef_)
print(reg.intercept_)

## Performance Measure

The performance measure is the way you want to evaluate a solution to the problem. It is the measurement you will make of the predictions made by a trained model on the test dataset and are typically specialized to the class of problem (classification, regression, clustering, etc.).

For example, for regression, **_MAE_** (Mean Absolute Error):

$$ MAE = \frac{1}{n}\sum_{i=1}^{n}|y^{real}_i-y^{pred}_i| $$

For classification, **_accuracy_**:

<p align="center">
  <img src="https://static.wixstatic.com/media/02a1ae_32cad84eaf3348059a8996d1b0f88627~mv2.jpg/v1/fill/w_597,h_416,al_c,q_90/02a1ae_32cad84eaf3348059a8996d1b0f88627~mv2.jpg" width="300" height="200">
</p>



$$ Accuracy = \frac{TP+TN}{TP+TN+FP+FN} $$

## A more real example

### But firts... what is a Decision Tree?

<p align="center">
  <img src="https://miro.medium.com/max/410/1*3L1-LxytBXu_26s4Bk4dFg.png" width="300" height="200">
</p>

A decision tree is a learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.

### and... what is boosting?

Traditionally, building a Machine Learning application consisted on taking a single learner, like a Logistic Regressor or a Decision Tree, feeding it data, and teaching it to perform a certain task through this data.

Then **ensemble** methods were born, which involve using many learners to enhance the performance of any single one of them individually. These methods can be described as techniques that use a group of weak learners (those who on average achieve only slightly better results than a random model) together, in order to create a stronger, aggregated one.

Boosting models fall inside this family of ensemble methods.

Boosting, initially named _Hypothesis Boosting_, consists on the idea of filtering or weighting the data that is used to train our team of weak learners, so that each new learner gives more weight or is only trained with observations that have been poorly classified by the previous learners.

By doing this our team of models learns to make accurate predictions on all kinds of data, not just on the most common or easy observations. Also, if one of the individual models is very bad at making predictions on some kind of observation, it does not matter, as the other N-1 models will most likely make up for it.

Boosting should not be confused with Bagging, which is the other main family of ensemble methods: while in bagging the weak learners are trained in parallel using randomness, in boosting the learners are trained sequentially, in order to be able to perform the task of data weighting/filtering described in the previous paragraph.

### The problem

With the anonymized flow of customers on a bank's website, new conversions have to be predicted for a period of time.

This was a [Kaggle](https://www.kaggle.com/competitions/banco-galicia-dataton-2019/overview/description) competition of the year 2019.

Import the libraries/modules

In [None]:
import pandas as pd
# HistGradientBoostingClassifier is a implementation of a decision tree 
# boosting algorithm
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

Load data

In [None]:
dataset = pd.concat([
    pd.read_csv("../data/pageviews.csv", parse_dates=["FEC_EVENT"]),
    pd.read_csv("../data/pageviews_complemento.csv",
    parse_dates=["FEC_EVENT"])
])

Inspect data

In [None]:
dataset.head(10)

In [None]:
pd.options.display.float_format = "{:,.2f}".format
dataset.describe(include="all", datetime_is_numeric=True)

1. The first thing we have to do is define how to structure the data and separate training and testing.

2. Then, as the prediction we have to make is at the user level, we are going to group all their navigation so that we have the same number of rows as the users we have.

3. Finally, for each of the explanatory variables that we have (PAGE, CONTENT_CATEGORY, CONTENT_CATEGORY_TOP, CONTENT_CATEGORY_BOTTOM, SITE_ID, ON_SITE_SEARCH_TERM) we will:

    - Add their frequency of occurrence of each value of each of the variables,
    - calculate the frequency ratio of each possible value in relation to all the values that the variable can take (ie: for PAGE = 1, we add the number of times the user visited PAGE 1 and then divide it by the total visits that made that user to all PAGE).

In [None]:
data = dataset[dataset["FEC_EVENT"].dt.month < 6]
print(f"The minimum date is {data['FEC_EVENT'].min()} and the maximum date is \
{data['FEC_EVENT'].max()}. \n")
train_data = []
for c in data.drop(["USER_ID", "FEC_EVENT"], axis=1).columns:
    print("Making", c)
    temp = pd.crosstab(data.USER_ID, data[c])
    temp.columns = [c + "_" + str(v) for v in temp.columns]
    train_data.append(temp.apply(lambda x: x / x.sum(), axis=1))
train_data = pd.concat(train_data, axis=1)
print(f"\nTrain shape is {train_data.shape}.")

In [None]:
data = dataset[dataset["FEC_EVENT"].dt.month.between(6, 9)]
print(f"The minimum date is {data['FEC_EVENT'].min()} and the maximum date is \
{data['FEC_EVENT'].max()}. \n")
test_data = []
for c in data.drop(["USER_ID", "FEC_EVENT"], axis=1).columns:
    print("Making", c)
    temp = pd.crosstab(data.USER_ID, data[c])
    temp.columns = [c + "_" + str(v) for v in temp.columns]
    test_data.append(temp.apply(lambda x: x / x.sum(), axis=1))
test_data = pd.concat(test_data, axis=1)
print(f"\nTest shape is {test_data.shape}.")

In [None]:
train_data.head()

In [None]:
filter_col = [col for col in train_data if col.startswith("PAGE")]
train_data[filter_col].iloc[0].sum()

Now that we have both datasets built, we are going to filter them, keeping the columns that exist in both, in order to train and predict on the same attributes.

In [None]:
features = list(set(train_data.columns).intersection(set(test_data.columns)))
train_data = train_data[features]
test_data = test_data[features]
print(f"Train shape is {train_data.shape}.")
print(f"Test shape is {test_data.shape}.")

Now we load the **conversiones.csv** file that has the target variable and that corresponds to the conversions made during 2018.

In [None]:
target = pd.read_csv("../data/conversiones.csv")
target.head()

We split the dataset again but looking 3 months ahead to align the prediction with the desired time window.

* train_data = 2018-01-01/2018-05-31, train_target = 2018-06-01/2018-09-30.
* test_data = 2018-06-01/2018-09-30, train_target = 2018-10-01/2018-12-31. 

In [None]:
train_target = pd.Series(0, index=train_data.index)
train_idx = set(target[target["mes"].between(
    6, 9)].USER_ID.unique()).intersection(set(train_data.index))
train_target.loc[list(train_idx)] = 1

test_target = pd.Series(0, index=test_data.index)
test_idx = set(target[target["mes"] > 9].USER_ID.unique()
               ).intersection(set(test_data.index))
test_target.loc[list(test_idx)] = 1

In [None]:
print("Class distribution in train")
print(train_target.value_counts())

print("\nClass distribution in test")
print(test_target.value_counts())

Train model and predict

In [None]:
learner = HistGradientBoostingClassifier(
    random_state=0).fit(train_data, train_target)


The algorithms in scikit-learn can predict with a confidence score (in some cases) or predict the target directly (by default it considers a cutoff point of 0.5)

In [None]:
learner.predict_proba(test_data)[0:10]

In [None]:
learner.predict(test_data)[0:10]

Performance

In [None]:
print("Accuracy Score: {}".format(accuracy_score(test_target,
learner.predict(test_data))))

In [None]:
print("Confusion Matrix(rows=real, columns=pred)")
print(
    pd.DataFrame(
        confusion_matrix(
            test_target,
            learner.predict(test_data)
            ), columns=['NO', 'YES'], index=['NO', 'YES']
            )
)