# Training a simple classifier
This notebook will fit a gradient boosted ensemble of trees to the infamous 1995 _breast cancer_ dataset. The goal is to produce a model that can somewhat-accurately predict if a patient has breast cancer or not.

We will then export the fitted model and deploy it using AWS Lambda and API Gateway to enable it for online consumption.

## Imports

In [None]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV, train_test_split

## Load data
The dataset has 560 observations, 30 features and one target.

By default, the target is encoded as:
- `0` for malignant tumors; and
- `1` for benign tumors.

We will flip the labels so that `1` implies malignant (line 12 in the following cell).

In [None]:
# Load dataset
all_data = load_breast_cancer()

# Features to pandas
X = pd.DataFrame(
    data=all_data['data'],
    columns=all_data['feature_names']
)

# Target to pandas
y = pd.DataFrame(
    data=(1 - all_data['target']), # Flip target labels
    columns=['malignant']
)

# Split data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42
)

## Feature selection
We will now instantiate a classifier with some generic hyperparameters. We will then recursively fit this model on the data, and in each step, we will drop the least important feature. We do this to limit the number of features needed to make a good prediction.

In [None]:
# Instantiate model
clf = GradientBoostingClassifier(
    learning_rate=0.01,
    n_estimators=1000,
    max_depth=3,
    max_features=9,
    random_state=42
)

# Instantiate feature eliminator
rfe = RFECV(
    estimator=clf,
    step=1,
    min_features_to_select=5,
    cv=10,
    scoring='f1',
    n_jobs=-1
)

# Fit many models, each with less features
rfe = rfe.fit(
    X=X_train,
    y=y_train['malignant']
)

# Store top-five features
cols = [rfe.get_feature_names_out()]

# Print results
print(f'The optimal model uses {rfe.n_features_} features:\n{rfe.get_feature_names_out()}')

## Model selection
Now that we have found the best subset of features for the basic model, we will play around with its hyperparameters to find the best overall model.

In [None]:
grid = {
    'n_estimators': [500, 1000, 1500, 2000],
    'max_depth': [1, 2, 3]
}

GridSearchCV(
    estimator=clf
)