# Training a simple classifier

> This notebook produces the same result as `train-model.py`. The only difference is that this is a notebook environment (and requires additional dependencies to run). You can run either file interchangeably to produce the Machine Learning model we'll place in the AWS Lambda.

This notebook will fit a gradient boosted ensemble of trees to the infamous 1995 _breast cancer_ dataset. The goal is to produce a model that can somewhat-accurately predict if a patient has breast cancer or not.

We will then export the fitted model and deploy it using AWS Lambda and API Gateway to enable it for online consumption.

## Important note about your environment

Run this notebook with the package versions specified in `requirements.txt`!

This is extremely important because scikit-learn is so heavy that it will not fit in the Lambda's deployment package. Hence, we will use a very specific Lambda Layer that is only compatible with these specifications. If you are use another configuration (i.e., scikit-learn version 1.0.0), your Lambda may not be able to load the final model.

## Imports

In [None]:
import pickle
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV, train_test_split, KFold
from sklearn.metrics import f1_score

## Load data
The dataset has 560 observations, 30 features and one target.

By default, the target is encoded as:
- `0` for malignant tumors; and
- `1` for benign tumors.

We will flip the labels so that `1` implies malignant (line 12 in the following cell).

In [None]:
# Load dataset
all_data = load_breast_cancer()

# Features to pandas
X = pd.DataFrame(
    data=all_data['data'],
    columns=all_data['feature_names']
)

# Target to pandas
y = pd.DataFrame(
    data=(1 - all_data['target']), # Flip target labels
    columns=['malignant']
)

# Split data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

## Feature selection
We will now instantiate a classifier with some generic hyperparameters. We will then recursively fit this model on the data, and in each step, we will drop the least important feature. We do this to limit the number of features needed to make a good prediction.

In [None]:
# Declare a splitter (used in cross validation)
cv_splitter = KFold(
    n_splits=10,
    shuffle=True,
    random_state=42
)

# Instantiate model
clf = GradientBoostingClassifier(
    learning_rate=0.01,
    n_estimators=1000,
    max_depth=3,
    max_features=9,
    random_state=42
)

# Instantiate feature eliminator
rfe = RFECV(
    estimator=clf,
    step=1,
    min_features_to_select=5,
    cv=cv_splitter,
    scoring='f1',
    n_jobs=-1
)

# Fit many models, each with less features
rfe = rfe.fit(
    X=X_train,
    y=y_train['malignant']
)

# Store top-five features
cols = rfe.get_feature_names_out()

# Print results
print(f'The optimal model uses {rfe.n_features_} features:\n{rfe.get_feature_names_out()}')

## Model selection
Now that we have found the best subset of features for the basic model, we will play around with its hyperparameters to find the best overall model.

In [None]:
# Hyperparameter candidates
grid = {
    'n_estimators': [500, 1000, 1500, 2000],
    'max_depth': [1, 2, 3]
}

# Instantiate search
search = GridSearchCV(
    estimator=clf,
    param_grid=grid,
    scoring='f1',
    n_jobs=-1,
    refit=True,
    cv=cv_splitter
)

# Find best combination
search = search.fit(
    X_train[cols],
    y_train['malignant']
)

Let's look at the CV results to determine the best set of hyperparameters

In [None]:
# Store results in pandas df
search_res = pd.DataFrame(
    data=search.cv_results_)[
        [
            'params',
            'mean_test_score',
            'std_test_score',
            'rank_test_score'
        ]
    ]

# View top-five models
search_res.sort_values('rank_test_score').head(5)

## Final model
Now that we know both the best subset of features and hyperparameters, we can persist the model.

In [None]:
# Instantiate model
clf = GradientBoostingClassifier()

# Set parameters
clf = clf.set_params(
    **search.best_estimator_.get_params()
)

# Fit on whole training dataset
clf.fit(
    X_train[cols],
    y_train['malignant']
)

# Make predictions
pred_train = clf.predict(X_train[cols])
pred_test = clf.predict(X_test[cols])

# Score predictions
f1_train = f1_score(
    y_true=y_train,
    y_pred=pred_train
)
f1_test = f1_score(
    y_true=y_test,
    y_pred=pred_test
)

# Summary
print(f'F1-score on training data: {round(f1_train, 2)}')
print(f'F1-score on testing data: {round(f1_test, 2)}')

We can tell that our model is slightly overfitting on the training data because the F1-score decreased by four percentage points on the testing dataset.

Regardless, this is a somewhat decent result, so we'll proceed to train the model on all the available data and export it using `pickle`.

In [None]:
# Fit on all data
clf = clf.fit(
    X=X_train[cols].values, # Train without feature names
    y=y_train['malignant']
)

# Export model
pickle.dump(
    obj=clf,
    file=open('../results/clf.sav', 'wb')
)

In order to make a prediction, the final model (`clf`) expects five inputs:
1. Mean concave points;
2. Worst radius;
3. Worst texture;
4. Worst area; and
5. Worst concave points.

For example:

In [None]:
clf.predict(
    X=[
        [
            0.19,
            33.13,
            23.58,
            3234.0,
            0.28
        ]
    ]
)