# stage3_build_train_ml
This notebook will take the labelled data generated by `stage3_build_labelling.ipynb` and train a classification model. stage3_build_labelling.ipynbstage3_build_labelling.ipynb

I will use good old RandomForest here, since it's powerful, intuitive to tune, and doesn't require feature normalisation/standardisation therefore saves me some time. 

But RandomForest does tend to overfit, so I'll do some GridSearch hyper parameter tuning with 5 fold cross validation. 

Normally we would test the model against a hold out set, but since we don't have any ground truth here, it's all labelled by myself in a weakly supervised way, I'll just use the entire data set for train and validation, skipping test. 

Finally I'll create an batch inference pipeline, so that we can test the model on new, unseen data. However we don't have any ground truth test set here, I'll just make up a few examples and see how it works. 

# Imports

In [40]:
import pandas as pd
import numpy as np
import joblib

from sklearn.model_selection import KFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, classification_report

In [2]:
from pathlib import Path
import sys

# a little path manipulation to load src/feature_engineering.py
root_path = str(Path('.').resolve().parent.absolute())

if root_path not in sys.path:
    sys.path.append(root_path)

from src.feature_engineering import fe_customer

In [3]:
RANDOM_SEED = 273

# Load data

In [4]:
clean_data = pd.read_parquet("../data/processed/clean_data.parquet")

In [5]:
features = fe_customer(clean_data)
# customer_id is potentially useful in data integration in the real world, 
# but we don't really need it here
features = features.drop(columns=['customer_id'])

In [6]:
labels = pd.read_parquet("../data/processed/training_set.parquet")

In [7]:
label_gender = labels.gender.replace({'MALE':0, 'FEMALE': 1, 'UNKNOWN': 2})

In [8]:
label_gender.value_counts()

1    34143
0    11765
2      371
Name: gender, dtype: int64

In [9]:
# Make sure labels and features line up perfectly
assert (label_gender.index == features.index).all()

# Train

Had to go pretty aggressive with the hyper parameters to reduce overfitting

In [45]:
parameters_grid = {
    'n_estimators':[10, 30, 100],
    'max_depth': [3, 6, 10],
    'min_samples_split': [6, 20, 60],
    'min_samples_leaf': [3, 10, 30],
}

# instantiate classifier and cross validation explicitly to set random_state
rf_clf = RandomForestClassifier(n_jobs=4, random_state=RANDOM_SEED)
kfolds = KFold(n_splits=5, random_state=RANDOM_SEED, shuffle=True)

grid_cv = GridSearchCV(rf_clf, parameters_grid, cv=kfolds, scoring='f1_weighted', n_jobs=1, verbose=10)
grid_cv.fit(features, label_gender)

Fitting 5 folds for each of 81 candidates, totalling 405 fits
[CV 1/5; 1/81] START max_depth=3, min_samples_leaf=3, min_samples_split=6, n_estimators=10
[CV 1/5; 1/81] END max_depth=3, min_samples_leaf=3, min_samples_split=6, n_estimators=10;, score=0.951 total time=   2.5s
[CV 2/5; 1/81] START max_depth=3, min_samples_leaf=3, min_samples_split=6, n_estimators=10
[CV 2/5; 1/81] END max_depth=3, min_samples_leaf=3, min_samples_split=6, n_estimators=10;, score=0.945 total time=   0.2s
[CV 3/5; 1/81] START max_depth=3, min_samples_leaf=3, min_samples_split=6, n_estimators=10
[CV 3/5; 1/81] END max_depth=3, min_samples_leaf=3, min_samples_split=6, n_estimators=10;, score=0.951 total time=   0.2s
[CV 4/5; 1/81] START max_depth=3, min_samples_leaf=3, min_samples_split=6, n_estimators=10
[CV 4/5; 1/81] END max_depth=3, min_samples_leaf=3, min_samples_split=6, n_estimators=10;, score=0.957 total time=   0.2s
[CV 5/5; 1/81] START max_depth=3, min_samples_leaf=3, min_samples_split=6, n_estimator

GridSearchCV(cv=KFold(n_splits=5, random_state=273, shuffle=True),
             estimator=RandomForestClassifier(n_jobs=4, random_state=273),
             n_jobs=1,
             param_grid={'max_depth': [3, 6, 10],
                         'min_samples_leaf': [3, 10, 30],
                         'min_samples_split': [6, 20, 60],
                         'n_estimators': [10, 30, 100]},
             scoring='f1_weighted', verbose=10)

In [46]:
print(grid_cv.best_params_)
print(grid_cv.best_score_)

{'max_depth': 10, 'min_samples_leaf': 3, 'min_samples_split': 6, 'n_estimators': 100}
0.9828576657833457


Quickly check the feature importance to see if it makes sense

In [47]:
pd.Series(grid_cv.best_estimator_.feature_importances_, index=features.columns).sort_values(ascending=False)

female_items                0.317795
male_items                  0.261112
mapp_items                  0.103576
mftw_items                  0.071007
wftw_items                  0.067796
wapp_items                  0.063777
mspt_items                  0.016947
unisex_items                0.016254
macc_items                  0.012961
wacc_items                  0.012312
items                       0.008033
wspt_items                  0.007378
revenue                     0.004896
orders                      0.003806
tenure_months               0.003445
redpen_discount_used        0.003303
days_since_last_order       0.003088
returns                     0.002522
days_since_first_order      0.002336
desktop_orders              0.001940
average_discount_onoffer    0.001936
average_discount_used       0.001773
other_collection_orders     0.001764
sacc_items                  0.001593
msite_orders                0.001377
coupon_discount_applied     0.001319
devices                     0.001313
h

In [48]:
all_preds = grid_cv.best_estimator_.predict(features)

print(
    classification_report(all_preds, label_gender, target_names=['MALE','FEMALE','UNKNOWN'])
)

              precision    recall  f1-score   support

        MALE       0.99      1.00      0.99     11751
      FEMALE       1.00      0.99      0.99     34516
     UNKNOWN       0.03      1.00      0.06        12

    accuracy                           0.99     46279
   macro avg       0.68      0.99      0.68     46279
weighted avg       1.00      0.99      0.99     46279



Just noticed UNKNOWN only has 12 support. Potentially suspicious labelling from me or that's how the data is like. I wouldn't spend more time to investigate here since it's just a coding test. 

Save the best model

In [50]:
joblib.dump(grid_cv.best_estimator_, "../models/best_model.joblib")

['../models/best_model.joblib']

# Inference pipline and manual sanity check
This should load in clean data from a CSV file, run it through feature engineering, than use the RandomForest model to make a prediction

I will test it against some made up examples I manually created in `/response/data/manual_test/made_up_test_data.csv`

In [51]:
def batch_inference(input_data, model):
    features = fe_customer(input_data)
    features = features.drop(columns=['customer_id'])
    
    preds = model.predict(features)
    
    label_dict = {0:'MALE', 1:'FEMALE', 2:'UNKNOWN'}
    gender = pd.Series(preds).replace(label_dict).rename('inferred_gender')
    
    output_df = pd.concat([gender, input_data], axis=1)
    return output_df

In [71]:
test_data = pd.read_csv("../data/manual_test/made_up_test_data.csv")

with pd.option_context('display.max_columns', 999):
    display(test_data)

Unnamed: 0,customer_id,days_since_first_order,days_since_last_order,is_newsletter_subscriber,orders,items,cancels,returns,different_addresses,shipping_addresses,devices,vouchers,cc_payments,paypal_payments,afterpay_payments,apple_payments,female_items,male_items,unisex_items,wapp_items,wftw_items,mapp_items,wacc_items,macc_items,mftw_items,wspt_items,mspt_items,curvy_items,sacc_items,msite_orders,desktop_orders,android_orders,ios_orders,other_device_orders,work_orders,home_orders,parcelpoint_orders,other_collection_orders,redpen_discount_used,coupon_discount_applied,average_discount_onoffer,average_discount_used,revenue
0,aaaa,1,0,N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
1,bbbb,30,10,N,3,6,0,0,0,0,0,0,0,0,0,1,1,5,1,1,0,1,0,3,1,0,0,0,0,0,0,3,0,0,0,0,0,0,100,200,0.1,0.1,50
2,ccc,3650,365,N,100,300,0,0,0,0,0,0,1,1,1,1,200,0,0,100,60,0,30,0,2,5,0,5,0,10,50,20,30,0,0,0,0,0,1000,2000,0.2,0.3,100000
3,dddd,60,30,Y,10,20,0,0,0,0,0,0,0,0,0,1,15,5,0,10,5,5,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0.0,0.0,1000
4,eee,500,5,Y,50,100,2,5,1,2,3,1,1,0,1,1,40,60,10,30,5,40,3,15,3,0,0,0,0,0,20,30,0,0,0,0,0,0,0,0,0.0,0.0,25000
5,fffff,500,5,Y,50,100,2,5,1,2,3,1,1,0,1,1,10,60,10,5,5,40,0,15,3,0,0,0,0,0,20,30,0,0,0,0,0,0,0,0,0.0,0.0,25000
6,ggggg,500,5,Y,50,43,2,5,1,2,3,1,1,0,1,1,10,33,0,10,0,15,0,15,3,0,0,0,0,0,20,10,0,0,0,0,0,0,0,0,0.0,0.0,100


In [72]:
model = joblib.load("../models/best_model.joblib")

In [73]:
test_results = batch_inference(test_data, model)

with pd.option_context('display.max_columns', 999):
    display(test_results)

Unnamed: 0,inferred_gender,customer_id,days_since_first_order,days_since_last_order,is_newsletter_subscriber,orders,items,cancels,returns,different_addresses,shipping_addresses,devices,vouchers,cc_payments,paypal_payments,afterpay_payments,apple_payments,female_items,male_items,unisex_items,wapp_items,wftw_items,mapp_items,wacc_items,macc_items,mftw_items,wspt_items,mspt_items,curvy_items,sacc_items,msite_orders,desktop_orders,android_orders,ios_orders,other_device_orders,work_orders,home_orders,parcelpoint_orders,other_collection_orders,redpen_discount_used,coupon_discount_applied,average_discount_onoffer,average_discount_used,revenue
0,FEMALE,aaaa,1,0,N,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,0.0,0
1,MALE,bbbb,30,10,N,3,6,0,0,0,0,0,0,0,0,0,1,1,5,1,1,0,1,0,3,1,0,0,0,0,0,0,3,0,0,0,0,0,0,100,200,0.1,0.1,50
2,FEMALE,ccc,3650,365,N,100,300,0,0,0,0,0,0,1,1,1,1,200,0,0,100,60,0,30,0,2,5,0,5,0,10,50,20,30,0,0,0,0,0,1000,2000,0.2,0.3,100000
3,FEMALE,dddd,60,30,Y,10,20,0,0,0,0,0,0,0,0,0,1,15,5,0,10,5,5,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0.0,0.0,1000
4,FEMALE,eee,500,5,Y,50,100,2,5,1,2,3,1,1,0,1,1,40,60,10,30,5,40,3,15,3,0,0,0,0,0,20,30,0,0,0,0,0,0,0,0,0.0,0.0,25000
5,FEMALE,fffff,500,5,Y,50,100,2,5,1,2,3,1,1,0,1,1,10,60,10,5,5,40,0,15,3,0,0,0,0,0,20,30,0,0,0,0,0,0,0,0,0.0,0.0,25000
6,MALE,ggggg,500,5,Y,50,43,2,5,1,2,3,1,1,0,1,1,10,33,0,10,0,15,0,15,3,0,0,0,0,0,20,10,0,0,0,0,0,0,0,0,0.0,0.0,100


The model is favouring FEMALE pretty heavily, must be my bias in the labelling process. 

# Conclusion

This model is far from perfect. For example the labelling is heavily biased since my domain knowledge and intuitive in online shopping is questionable. Some ground truth data will be super helpful. I could have tried other models like SVMs that are less susceptible to overfitting. 

__However I believe this workflow and methodology is robust enough to scale into real world use case. With SME inputs and ground truth data, the Weak Supervision labelling will improve drastically. In-depth tuning of the classification model will ensure the best learning results. Let's also not forget this data set is still relatively small and the features used are relatively simple.__