# Machine Learning Workshop - Binary Classification

## The problem

With the anonymized flow of customers on a bank's website, new conversions have to be predicted for a period of time.

This was a [Kaggle](https://www.kaggle.com/competitions/banco-galicia-dataton-2019/overview/description) competition of the year 2019.

Import the libraries/modules

In [1]:
import pandas as pd
import numpy as np
from lightgbm import LGBMClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

Load data

In [2]:
dataset = pd.concat([
    pd.read_csv("../data/pageviews.csv", parse_dates=["FEC_EVENT"]),
    pd.read_csv("../data/pageviews_complemento.csv",
    parse_dates=["FEC_EVENT"])
])

Inspect data

In [3]:
dataset.head(10)

Unnamed: 0,FEC_EVENT,PAGE,CONTENT_CATEGORY,CONTENT_CATEGORY_TOP,CONTENT_CATEGORY_BOTTOM,SITE_ID,ON_SITE_SEARCH_TERM,USER_ID
0,2018-03-30 07:35:48,1,1,1,1,1,1,0
1,2018-03-30 07:35:52,2,2,2,2,2,1,0
2,2018-03-30 07:36:11,3,2,2,2,3,1,0
3,2018-03-30 07:36:16,4,2,2,2,3,1,0
4,2018-03-30 07:41:38,5,2,2,2,2,1,0
5,2018-03-30 07:41:42,2,2,2,2,2,1,0
6,2018-03-30 07:42:01,3,2,2,2,3,1,0
7,2018-03-30 07:42:05,4,2,2,2,3,1,0
8,2018-03-30 07:43:43,3,2,2,2,3,1,0
9,2018-03-30 07:44:14,6,2,2,2,3,1,0


In [4]:
pd.options.display.float_format = "{:,.2f}".format
dataset.describe(include="all", datetime_is_numeric=True)

Unnamed: 0,FEC_EVENT,PAGE,CONTENT_CATEGORY,CONTENT_CATEGORY_TOP,CONTENT_CATEGORY_BOTTOM,SITE_ID,ON_SITE_SEARCH_TERM,USER_ID
count,22870354,22870354.0,22870354.0,22870354.0,22870354.0,22870354.0,22870354.0,22870354.0
mean,2018-07-09 15:04:17.147753216,68.48,2.3,1.99,2.3,2.55,1.0,5654.89
min,2018-01-01 00:09:17,1.0,1.0,1.0,1.0,1.0,1.0,0.0
25%,2018-04-12 12:51:29,3.0,2.0,2.0,2.0,2.0,1.0,2730.0
50%,2018-07-11 14:21:00,21.0,2.0,2.0,2.0,3.0,1.0,5931.0
75%,2018-10-04 16:45:36,59.0,2.0,2.0,2.0,3.0,1.0,8468.0
max,2018-12-31 23:59:59,1835.0,68.0,13.0,68.0,4.0,295.0,11675.0
std,,170.64,1.95,0.45,1.95,0.64,0.71,3272.65


1. The first thing we have to do is define how to structure the data and separate training and testing.

2. Then, as the prediction we have to make is at the user level, we are going to group all their navigation so that we have the same number of rows as the users we have.

3. Finally, for each of the explanatory variables that we have (PAGE, CONTENT_CATEGORY, CONTENT_CATEGORY_TOP, CONTENT_CATEGORY_BOTTOM, SITE_ID, ON_SITE_SEARCH_TERM) we will:

    - Add their frequency of occurrence of each value of each of the variables,
    - calculate the frequency ratio of each possible value in relation to all the values that the variable can take (ie: for PAGE = 1, we add the number of times the user visited PAGE 1 and then divide it by the total visits that made that user to all PAGE).

In [5]:
data = dataset[dataset["FEC_EVENT"].dt.month < 6]
print(f"The minimum date is {data['FEC_EVENT'].min()} and the maximum date is \
{data['FEC_EVENT'].max()}. \n")
train_data = []
for c in data.drop(["USER_ID", "FEC_EVENT"], axis=1).columns:
    print("Making", c)
    temp = pd.crosstab(data.USER_ID, data[c])
    temp.columns = [c + "_" + str(v) for v in temp.columns]
    train_data.append(temp.apply(lambda x: x / x.sum(), axis=1))
train_data = pd.concat(train_data, axis=1)
print(f"\nTrain shape is {train_data.shape}.")

The minimum date is 2018-01-01 00:09:17 and the maximum date is 2018-05-31 23:59:58. 

Making PAGE
Making CONTENT_CATEGORY
Making CONTENT_CATEGORY_TOP
Making CONTENT_CATEGORY_BOTTOM
Making SITE_ID
Making ON_SITE_SEARCH_TERM

Train shape is (11201, 1824).


In [6]:
data = dataset[dataset["FEC_EVENT"].dt.month.between(6, 9)]
print(f"The minimum date is {data['FEC_EVENT'].min()} and the maximum date is \
{data['FEC_EVENT'].max()}. \n")
test_data = []
for c in data.drop(["USER_ID", "FEC_EVENT"], axis=1).columns:
    print("Making", c)
    temp = pd.crosstab(data.USER_ID, data[c])
    temp.columns = [c + "_" + str(v) for v in temp.columns]
    test_data.append(temp.apply(lambda x: x / x.sum(), axis=1))
test_data = pd.concat(test_data, axis=1)
print(f"\nTest shape is {test_data.shape}.")

The minimum date is 2018-06-01 00:00:02 and the maximum date is 2018-09-30 23:59:55. 

Making PAGE
Making CONTENT_CATEGORY
Making CONTENT_CATEGORY_TOP
Making CONTENT_CATEGORY_BOTTOM
Making SITE_ID
Making ON_SITE_SEARCH_TERM

Test shape is (11419, 1489).


In [7]:
train_data.head()

Unnamed: 0_level_0,PAGE_1,PAGE_2,PAGE_3,PAGE_4,PAGE_5,PAGE_6,PAGE_7,PAGE_8,PAGE_9,PAGE_10,...,ON_SITE_SEARCH_TERM_284,ON_SITE_SEARCH_TERM_285,ON_SITE_SEARCH_TERM_286,ON_SITE_SEARCH_TERM_287,ON_SITE_SEARCH_TERM_288,ON_SITE_SEARCH_TERM_289,ON_SITE_SEARCH_TERM_290,ON_SITE_SEARCH_TERM_291,ON_SITE_SEARCH_TERM_292,ON_SITE_SEARCH_TERM_293
USER_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.05,0.09,0.02,0.01,0.0,0.0,0.0,0.0,0.06,0.18,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.13,0.12,0.02,0.01,0.0,0.0,0.0,0.0,0.07,0.18,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.03,0.18,0.07,0.02,0.02,0.0,0.0,0.0,0.06,0.13,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.16,0.1,0.04,0.03,0.01,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.02,0.09,0.08,0.0,0.03,0.03,0.0,0.0,0.0,0.24,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
filter_col = [col for col in train_data if col.startswith("PAGE")]
train_data[filter_col].iloc[0].sum()

1.0

Now that we have both datasets built, we are going to filter them, keeping the columns that exist in both, in order to train and predict on the same attributes.

In [9]:
features = list(set(train_data.columns).intersection(set(test_data.columns)))
train_data = train_data[features]
test_data = test_data[features]
print(f"Train shape is {train_data.shape}.")
print(f"Test shape is {test_data.shape}.")

Train shape is (11201, 1287).
Test shape is (11419, 1287).


Now we load the **conversiones.csv** file that has the target variable and that corresponds to the conversions made during 2018.

In [10]:
target = pd.read_csv("../data/conversiones.csv")
target.head()

Unnamed: 0,mes,anio,USER_ID
0,7.0,2018.0,1410.0
1,8.0,2018.0,10755.0
2,8.0,2018.0,8270.0
3,10.0,2018.0,7558.0
4,9.0,2018.0,10731.0


We split the dataset again but looking 3 months ahead to align the prediction with the desired time window.

* train_data = 2018-01-01/2018-05-31, train_target = 2018-06-01/2018-09-30.
* test_data = 2018-06-01/2018-09-30, train_target = 2018-10-01/2018-12-31. 

In [11]:
train_target = pd.Series(0, index=train_data.index)
train_idx = set(target[target["mes"].between(
    6, 9)].USER_ID.unique()).intersection(set(train_data.index))
train_target.loc[list(train_idx)] = 1

test_target = pd.Series(0, index=test_data.index)
test_idx = set(target[target["mes"] > 9].USER_ID.unique()
               ).intersection(set(test_data.index))
test_target.loc[list(test_idx)] = 1

In [12]:
print("Class distribution in train")
print(train_target.value_counts())

print("\nClass distribution in test")
print(test_target.value_counts())

Class distribution in train
0    10704
1      497
dtype: int64

Class distribution in test
0    11033
1      386
dtype: int64


Train model and predict

In [13]:
learner = LGBMClassifier(
    random_state=0).fit(train_data, train_target)


The algorithms in scikit-learn API can predict with a confidence score (in some cases) or predict the target directly (by default it considers a cutoff point of 0.5)

In [14]:
learner.predict_proba(test_data)[0:10]

array([[9.96906420e-01, 3.09358022e-03],
       [9.99279814e-01, 7.20186469e-04],
       [9.99514361e-01, 4.85639499e-04],
       [9.83401599e-01, 1.65984007e-02],
       [9.96897116e-01, 3.10288367e-03],
       [9.97671708e-01, 2.32829226e-03],
       [9.99233649e-01, 7.66351400e-04],
       [9.92498929e-01, 7.50107099e-03],
       [9.96851419e-01, 3.14858143e-03],
       [9.99306174e-01, 6.93826348e-04]])

In [15]:
learner.predict(test_data)[0:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

## Performance

In [16]:
print("Accuracy Score: {}".format(accuracy_score(test_target,
learner.predict(test_data))))

Accuracy Score: 0.9635694894474122


In [17]:
print("Confusion Matrix(rows=real, columns=pred)")
print(
    pd.DataFrame(
        confusion_matrix(
            test_target,
            learner.predict(test_data)
            ), columns=['NO', 'YES'], index=['NO', 'YES']
            )
)

Confusion Matrix(rows=real, columns=pred)
        NO  YES
NO   10994   39
YES    377    9


**_Accuracy_**:

<p align="center">
  <img src="https://static.wixstatic.com/media/02a1ae_32cad84eaf3348059a8996d1b0f88627~mv2.jpg/v1/fill/w_597,h_416,al_c,q_90/02a1ae_32cad84eaf3348059a8996d1b0f88627~mv2.jpg" width="300" height="200">
</p>



$$ Accuracy = \frac{TP+TN}{TP+TN+FP+FN} $$

**_Precision_**:

What proportion of positive identifications was actually correct?

Let's calculate precision for our model:

$$ Precision = \frac{TP}{TP+FP} = \frac{9}{9+39} = 0.19 $$

Our model has a precision of 0.19 or, in other words, when it predicts "YES", it is correct 19% of the time.

**_Recall_**:

What proportion of actual positives was identified correctly?

$$ Recall = \frac{TP}{TP+FN} = \frac{9}{9+377} = 0.02 $$

Our model has a recall of 0.02 or, in other words, it correctly identifies 2% of all "YES".

**_F1 Score_**:

The F1 score is defined as the harmonic mean of precision and recall.

$$ F1 Score = 2 * \frac{Precision*Recall}{Precision+Recall} = 2 * \frac{0.19*0.02}{0.19+0.02} = 0.04 $$

In [18]:
from sklearn.metrics import classification_report

In [19]:
print(classification_report(
    test_target,
    learner.predict(test_data),
    target_names=['NO', 'YES'])
)

              precision    recall  f1-score   support

          NO       0.97      1.00      0.98     11033
         YES       0.19      0.02      0.04       386

    accuracy                           0.96     11419
   macro avg       0.58      0.51      0.51     11419
weighted avg       0.94      0.96      0.95     11419



The problem here is that an accuracy of 96% sounds like a great result, whereas the model performs very poorly. In conclusion: accuracy is not a good metric to use when you have class imbalance.

## Balancing

Oversampling increases the number of minority class members in the training set.

To perform the oversampling we use the library [imbalanced-learn](https://imbalanced-learn.org/stable/)

In [20]:
from imblearn.over_sampling import RandomOverSampler

In [21]:
balance = RandomOverSampler(random_state=0)
train_data, train_target = balance.fit_resample(train_data, train_target)

In [22]:
print("Class distribution in train after balance")
print(train_target.value_counts())

Class distribution in train after balance
0    10704
1    10704
dtype: int64


In [23]:
learner = LGBMClassifier(
    random_state=0).fit(train_data, train_target)

In [24]:
print("Confusion Matrix(rows=real, columns=pred)")
print(
    pd.DataFrame(
        confusion_matrix(
            test_target,
            learner.predict(test_data)
            ), columns=['NO', 'YES'], index=['NO', 'YES']
            )
)

Confusion Matrix(rows=real, columns=pred)
        NO  YES
NO   10433  600
YES    268  118


In [25]:
print(classification_report(
    test_target,
    learner.predict(test_data),
    target_names=['NO', 'YES'])
)

              precision    recall  f1-score   support

          NO       0.97      0.95      0.96     11033
         YES       0.16      0.31      0.21       386

    accuracy                           0.92     11419
   macro avg       0.57      0.63      0.59     11419
weighted avg       0.95      0.92      0.93     11419



## Feature Selection

Feature selection is the process of reducing the number of input variables.

It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model.

Statistical-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable.

Here we use [ANOVA F-value](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html).

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif

In [27]:
pipeline = Pipeline([
    ('fs', SelectKBest(score_func=f_classif, k=300)),
    ('clf', LGBMClassifier(random_state=0))
])

In [28]:
learner = pipeline.fit(train_data, train_target)

In [29]:
print("Confusion Matrix(rows=real, columns=pred)")
print(
    pd.DataFrame(
        confusion_matrix(
            test_target,
            learner.predict(test_data)
            ), columns=['NO', 'YES'], index=['NO', 'YES']
            )
)

Confusion Matrix(rows=real, columns=pred)
        NO  YES
NO   10412  621
YES    265  121


In [30]:
print(classification_report(
    test_target,
    learner.predict(test_data),
    target_names=['NO', 'YES'])
)

              precision    recall  f1-score   support

          NO       0.98      0.94      0.96     11033
         YES       0.16      0.31      0.21       386

    accuracy                           0.92     11419
   macro avg       0.57      0.63      0.59     11419
weighted avg       0.95      0.92      0.93     11419



## Feature Engineering

Feature engineering is the process of selecting, manipulating, and transforming raw data into features. In order to make machine learning work well on new tasks, it might be necessary to design and train better features.

In [31]:
learner = LGBMClassifier(
    random_state=0).fit(train_data, train_target)

In [32]:
fi = pd.Series(
    learner.feature_importances_ / learner.feature_importances_.sum(),
    index=train_data.columns
)
print(fi.sort_values(ascending=False).head(10))

PAGE_65               0.02
PAGE_41               0.02
PAGE_39               0.01
PAGE_87               0.01
PAGE_5                0.01
PAGE_2                0.01
CONTENT_CATEGORY_16   0.01
PAGE_20               0.01
PAGE_69               0.01
PAGE_110              0.01
dtype: float64


In [33]:
train_data["FI"] = train_data['PAGE_65'] + train_data['PAGE_41']
test_data["FI"] = test_data['PAGE_65'] + test_data['PAGE_41']

In [34]:
learner = LGBMClassifier(
    random_state=0).fit(train_data, train_target)

In [35]:
print("Confusion Matrix(rows=real, columns=pred)")
print(
    pd.DataFrame(
        confusion_matrix(
            test_target,
            learner.predict(test_data)
            ), columns=['NO', 'YES'], index=['NO', 'YES']
            )
)

Confusion Matrix(rows=real, columns=pred)
        NO  YES
NO   10394  639
YES    245  141


In [36]:
print(classification_report(
    test_target,
    learner.predict(test_data),
    target_names=['NO', 'YES'])
)

              precision    recall  f1-score   support

          NO       0.98      0.94      0.96     11033
         YES       0.18      0.37      0.24       386

    accuracy                           0.92     11419
   macro avg       0.58      0.65      0.60     11419
weighted avg       0.95      0.92      0.93     11419



In [37]:
train_conversions = target[target["mes"] < 6].groupby(
    "USER_ID")["USER_ID"].count().reset_index(name="CANT_CONV").set_index(
        'USER_ID')
train_data = train_data.merge(
    train_conversions, how="left", left_index=True, right_index=True).fillna(0)
train_data["CANT_CONV"] = np.where(train_data["CANT_CONV"] > 0, 1, 0)

test_conversions = target[target["mes"] < 10].groupby(
    "USER_ID")["USER_ID"].count().reset_index(name="CANT_CONV").set_index(
        'USER_ID')
test_data = test_data.merge(
    test_conversions, how="left", left_index=True, right_index=True).fillna(0)
test_data["CANT_CONV"] = np.where(test_data["CANT_CONV"] > 0, 1, 0)

In [38]:
learner = LGBMClassifier(
    random_state=0).fit(train_data, train_target)

In [39]:
print("Confusion Matrix(rows=real, columns=pred)")
print(
    pd.DataFrame(
        confusion_matrix(
            test_target,
            learner.predict(test_data)
            ), columns=['NO', 'YES'], index=['NO', 'YES']
            )
)

Confusion Matrix(rows=real, columns=pred)
        NO  YES
NO   10682  351
YES    331   55


In [40]:
print(classification_report(
    test_target,
    learner.predict(test_data),
    target_names=['NO', 'YES'])
)

              precision    recall  f1-score   support

          NO       0.97      0.97      0.97     11033
         YES       0.14      0.14      0.14       386

    accuracy                           0.94     11419
   macro avg       0.55      0.56      0.55     11419
weighted avg       0.94      0.94      0.94     11419



In [41]:
train_data.drop("CANT_CONV", inplace=True, axis=1)
test_data.drop("CANT_CONV", inplace=True, axis=1)

We could try other things, like calculating the time spent on each page, but we leave that as homework :P

## Threshold optimization

Many machine learning algorithms are capable of predicting a probability or scoring of class membership, and this must be interpreted before it can be mapped to a crisp class label. This is achieved by using a threshold, such as 0.5, where all values equal or greater than the threshold are mapped to one class and all other values are mapped to another class.

For those classification problems that have a severe class imbalance, the default threshold can result in poor performance. As such, a simple and straightforward approach to improving the performance of a classifier that predicts probabilities on an imbalanced classification problem is to tune the threshold used to map probabilities to class labels.

In [42]:
from numpy import arange
from numpy import argmax
from sklearn.metrics import f1_score

In [43]:
def to_labels(pos_probs, threshold):
	return (pos_probs >= threshold).astype('int')

In [44]:
learner = LGBMClassifier(
    random_state=0).fit(train_data, train_target)

In [45]:
thresholds = arange(0, 1, 0.001)
scores = [f1_score(test_target, to_labels(learner.predict_proba(test_data)[:, -1], t)) for t in thresholds]
ix = argmax(scores)
print('Threshold=%.3f, F-Score=%.5f' % (thresholds[ix], scores[ix]))

Threshold=0.496, F-Score=0.24253


In [46]:
print("Confusion Matrix(rows=real, columns=pred)")
print(
    pd.DataFrame(
        confusion_matrix(
            test_target,
            np.where(learner.predict_proba(test_data)[:, -1] >= 0.40, 1, 0)
            ), columns=['NO', 'YES'], index=['NO', 'YES']
            )
)

Confusion Matrix(rows=real, columns=pred)
        NO  YES
NO   10165  868
YES    227  159


In [47]:
print(classification_report(
    test_target,
    np.where(learner.predict_proba(test_data)[:, -1] >= 0.40, 1, 0),
    target_names=['NO', 'YES'])
)

              precision    recall  f1-score   support

          NO       0.98      0.92      0.95     11033
         YES       0.15      0.41      0.23       386

    accuracy                           0.90     11419
   macro avg       0.57      0.67      0.59     11419
weighted avg       0.95      0.90      0.92     11419



## Coming soon...

* Overfitting,
* bias–variance tradeoff,
* cross validation,
* hyperparameter optimization,
* model selection,
* stacking,
* and much more.