<h1 style="font-size:30px">Model Training - Classification</h1>
<hr>

1. Split the dataset
2. Build model pipelines
3. Declare hyperparameters o tune
4. Fit and tune models with cross-validation
5. Evaluate metrics and select winner

<span style="font-size:18px">**Import libraries**</span>

In [1]:
# Numpy for numerical computing
import numpy as np

# Pandas for Dataframes
import pandas as pd
pd.set_option('display.max_columns',100)

# Matplolib for visualization
from matplotlib import pyplot as plt
# display plots in the notebook
%matplotlib inline

# Seaborn for easier visualization
import seaborn as sns

# Scikit-Learn for modeling
import sklearn

# Pickle for saving model files
import pickle

In [2]:
# Function for splitting training and test set
from sklearn.model_selection import train_test_split

# Function for creating model pipelines
from sklearn.pipeline import make_pipeline

# Function for standardization
from sklearn.preprocessing import StandardScaler

# Helper for cross-validation
from sklearn.model_selection import GridSearchCV

**Classification Algorithms**

In [3]:
# Import Logistic Regression
from sklearn.linear_model import LogisticRegression

# Import RandomForestClassifier and GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Classification metrics
from sklearn.metrics import roc_curve, auc, confusion_matrix

<span style="font-size:18px">**Load analytical base table**</span>

In [4]:
# Load analytical base table from Feature Engineering
df = pd.read_csv('analytical_base_table.csv')

<span style="font-size:18px">**1. Split the dataset**</span><br>

Separate the dataframe into separate objects for the target variable (y) and the input features (X)

In [5]:
# Create separate object for target variable
y = df.class_p

# Create separate object for input features
X = df.drop('class_p', axis = 1)

**Training sets** are used to fit and tune the models<br>
**Test sets** are put aside as unseen data to evaluate your models
<br>
* Split the train and test set, passing in the argument **test_size = 0.2** to set aside 20% of the observations for the test set
* The **random_state = 1234** is set for replicable results
* **Important**: For classification model also pass in the argument **stratify = df.target** in order to make sure the target variable's classes are balanced in each subset of data. This is **stratified random sampling**

In [6]:
# Split X and y into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1234)

# Print number of observations in X_train, X_test, y_train, and y_test
print(len(X_train), len(X_test), len(y_train), len(y_test))

6499 1625 6499 1625


<span style="font-size:18px">**2. Build model pipelines**</span><br>
The pipeline will standardize the data first, then apply the model algorithm to it

**Preprocessing**: should be performed inside the cross-validation loop
* Transform or scale the features
* Perform automatic feature reduction (e.g. PCA)
* Remove correlated features<br>
<br>
**Standartization**: transforms all features to the same **scale** by substracting means and dividing by standard deviations.
* Feature's distribution **centered around zero, with unit variance**

* The **random_state = 123** is set for replicable results

**Classification pipelines**

In [7]:
# Create pipelines dictionary
pipelines = {'l1': make_pipeline(StandardScaler(), LogisticRegression(penalty = 'l1', random_state = 123)),
             'l2': make_pipeline(StandardScaler(), LogisticRegression(penalty = 'l2', random_state = 123)),
             'rf': make_pipeline(StandardScaler(), RandomForestClassifier(random_state = 123)),
             'gb': make_pipeline(StandardScaler(), GradientBoostingClassifier(random_state = 123))}

<span style="font-size:18px">**3. Declare hyperparameters to tune**</span><br>
**Hyperparameters** express "higher-level" structural settings for modeling algorithms<br>
* e.g. strength of the penalty used in regularized regression
* e.g. the number of trees to include in a random forest
* They are **decided** before training the model because they cannot be learned from the data

**Classification hyperparameters**

* **C** is the **strength of the penalty**, inverse of alpha from regularization strength
* Higher values of C mean weacker penalties
* C is a positive value, typically between 0 and 1000
* The default is 1.0

In [8]:
# Logistic Regression hyperparameters
l1_hyperparameters = {'logisticregression__C': np.linspace(1e-3, 1e3, 10)}
l2_hyperparameters = {'logisticregression__C': np.linspace(1e-3, 1e3, 10)}

In [9]:
# Random Forest hyperparameters
rf_hyperparameters = {'randomforestclassifier__n_estimators': [100, 200],
                      'randomforestclassifier__max_features': ['auto', 'sqrt', 0.33]}

In [10]:
# Boosted tree hyperparameters
gb_hyperparameters = {'gradientboostingclassifier__n_estimators': [100, 200],
                      'gradientboostingclassifier__learning_rate': [0.05, 0.1, 0.2],
                      'gradientboostingclassifier__max_depth': [1, 3, 5]}

In [11]:
# Create hyperparameters
hyperparameters = {'l1': l1_hyperparameters,
                   'l2': l2_hyperparameters,
                   'rf': rf_hyperparameters,
                   'gb': gb_hyperparameters}

<span style="font-size:18px">**4. Fit and tune models with cross-validation**</span><br>

The GridSearchCV function performs cross-validation on the **hyperparameter grid**, through each **combination of values**. It then calculates **cross-validated scores** (using performance metrics) for each combination of hyperparameter values and picks the combination that has the best score
* **cv** is the number of cross-validation folds
* **n_jobs = -1** trains in parallel across the maximum number of cores of the computer, speeding it up

In [12]:
# Create empty dictionary called fitted_models
fitted_models = {}

# Loop through model pipelines, tuning each one and saving it to fitted_models
for name, pipeline in pipelines.items():
    
    # Create cross-validation object from pipeline and hyperparameters
    model = GridSearchCV(pipeline, hyperparameters[name], cv = 10, n_jobs = -1)
    
    # Fit model on X_train, y_train
    model.fit(X_train, y_train)
    
    # Store model in fitted_models[name]
    fitted_models[name] = model
    
    # Print when the model is fitted
    print(name, 'has been fitted.')

l1 has been fitted.
l2 has been fitted.
rf has been fitted.
gb has been fitted.


<span style="font-size:18px">**5. Evaluate models and select winner**</span><br>

In [13]:
# Display best_score_ for each fitted_model
for name, model in fitted_models.items():
    print(name, model.best_score_)

l1 0.9998461301738729
l2 0.9998461301738729
rf 1.0
gb 0.9998461301738729


**Classification metrics**

First evaluate the models by looking at their **cross-validated performance** on the training set, through the **holdout accuracy** scores
* Accuracy is simply the percent of observations correctly classified by the model
* Because is the average accuracy from the **holdout folds**, higher is almost always better

Straight accuracy is not always the best way to evaluate a classification model, specifically when evaluating **imbalanced classes** in the target variable<br>
<br>
**Area under ROC curve** is the most reliable metric for classification tasks. It is equivalent to the probability that a randomly chosen **positive** observation ranks higher (has a higher predicted probability) than a randomly chosen **negative** observation
* ROC curve is a way to visualize the **relationship between TPR (true positive rate) and FPR (false positive rate)** for classification models
* Plot the TPR and FPR at different **thresholds**

In [14]:
# Loop through the fitted_models to predict probabilities
for name, model in fitted_models.items():
    pred = model.predict_proba(X_test)
    
    # Get just the prediction for the positive class (1)
    pred = [p[1] for p in pred]
    
    # Calculate ROC curve from y_test and pred
    fpr, tpr, thresholds = roc_curve(y_test, pred)
    
    # Calculate and print AUROC
    print(name, auc(fpr, tpr))

l1 1.0
l2 1.0
rf 1.0
gb 1.0


In [15]:
# Selected winning hyperparameters
print(fitted_models['l1'].best_estimator_)
print(fitted_models['l2'].best_estimator_)
print(fitted_models['rf'].best_estimator_)
print(fitted_models['gb'].best_estimator_)

Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=111.11200000000001, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l1', random_state=123,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])
Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=111.11200000000001, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=123,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])
Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestclassifier', RandomForestClassifier(boot