# ECB Data Academy - Week 5 - Ensembles
[Krisolis](http://www.krisolis.ie)

## Different Model Types in sklearn

This notebook demonstrates how diferent model types can be trained in **sklearn**.

### Package Imports

To build predictive models in Python we use a set of libraries that are imported here. In particular **pandas** and **sklearn** are particularly important.

In [None]:
# Saving Python ojects
import os
import pickle

# General data handling
import pandas as pd
import numpy as np

# Drawing plots
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

import pandas_profiling

# Machine learning
import sklearn
import sklearn.impute
import sklearn.model_selection
import sklearn.metrics
import sklearn.tree
import sklearn.svm
import sklearn.ensemble
import sklearn.linear_model
import sklearn.neighbors

### Load & Partition Data

#### Load Data

To support data exploration and manipulation it is easiest o load datasets as **pandas DataFrames**. 

In [None]:
dataset = pd.read_csv('../Data/ACME_ABT.csv')
display(dataset.head(10))

#### Explore the Dataset

Examine the distribution of the two classes

In [None]:
dataset["churn"].value_counts()

Generate a suite of summary statistics for the numeric and categorical features in the data, and count missing values. 

In [None]:
# Print descriptive statsitcs for each column
print("Summary Stats")
if dataset.select_dtypes(include=[np.number]).shape[1] > 0: 
    display(dataset.describe(include="number").transpose())
if dataset.select_dtypes(include=[object]).shape[1] > 0: 
    display(dataset.describe(include="object").transpose())

# Check for presence of missing values
print("Missing Values")
print(dataset.isnull().sum())

A **ProfileReport** from **pandas_profiling** gives a very useful summary of the dataset and highliughts potential issues. 

In [None]:
pandas_profiling.ProfileReport(dataset, 
                               minimal = True)

In [None]:
# A bug in pandas_profiler means plots don;t appear after calling it. This re-import of matplotlib fixes the bug.
import matplotlib.pyplot as plt
%matplotlib inline

#### Prepare Data for Modelling

We select features, imp-iuts missing values, replace spurious values and change categorical features to numeric. 

In [None]:
X = dataset[['age',
 'income',
 'numHandsets',  
 'handsetAge',
 'smartPhone',
 'currentHandsetPrice',
 'avgBill',
 'avgOverBundleMins',
 'avgRoamCalls',
 'callMinutesChangePct',
 'callMinutesChangePct',
 'billAmountChangePct',
 'billAmountChangePct',
 'avgReceivedMins',
 'avgOutCalls',
 'avgInCalls',
 'peakOffPeakRatio',
 'peakOffPeakRatioChangePct',
 'avgDroppedCalls',
 'avgDroppedCalls',
 'lifeTime',
 'newFrequentNumbers',
 'regionType',
 'marriageStatus',
 'creditRating']]
y = dataset["churn"]

In [None]:
X.loc[X['regionType'] == 's','regionType'] = "suburban"
X.loc[X['regionType'] == 't','regionType'] = "town"
X.loc[X['regionType'] == 'r','regionType'] = "rural"

In [None]:
regionType_imputer = sklearn.impute.SimpleImputer(strategy="most_frequent")
regionType_imputer.fit(X['regionType'].values.reshape(-1, 1))
X['regionType'] = regionType_imputer.transform(X['regionType'].values.reshape(-1, 1))

age_imputer = sklearn.impute.SimpleImputer(missing_values = 0, strategy="mean")
age_imputer.fit(X['age'].values.reshape(-1, 1))
X['age'] = age_imputer.transform(dataset['age'].values.reshape(-1, 1))

In [None]:
creditRating_oe = sklearn.preprocessing.OrdinalEncoder()
creditRating_oe.fit(X['creditRating'].values.reshape(-1, 1))
X['creditRating'] = creditRating_oe.transform(X['creditRating'].values.reshape(-1, 1))

In [None]:
X = pd.get_dummies(X)

#### Examine Transformed Data

Examine the transformed dataset before modelling.

In [None]:
print(X.shape)
display(X.head())

In [None]:
X.columns

In [None]:
pandas_profiling.ProfileReport(X, 
                               minimal = True)

In [None]:
# A bug in pandas_profiler means plots don;t appear after calling it. This re-import of matplotlib fixes the bug.
import matplotlib.pyplot as plt
%matplotlib inline

#### Partition Data

Split the data into a **training set**, a **validation set**, and a **test set**.

In [None]:
X_train_plus_valid, X_test, y_train_plus_valid, y_test \
    = sklearn.model_selection.train_test_split(X, y, random_state=0,
                                               train_size = 0.7, 
                                               stratify = y)

X_train, X_valid, y_train, y_valid \
    = sklearn.model_selection.train_test_split(X_train_plus_valid, 
                                               y_train_plus_valid, 
                                               random_state=0, 
                                               train_size = 0.5/0.7,
                                               stratify=y_train_plus_valid)

Examine the partitions created. 

In [None]:
print(X_train.shape)
display(X_train.head())

In [None]:
print(X_valid.shape)
display(X_valid.head())

In [None]:
print(y_train.shape)
display(y_train.head())

### Different Model Types

One of the great things about using **sklearn** is that all of the model types use the same pattern so changing to other model types is very straight forward. 

#### Bagging

Bagging is a basic ensemble model approach to machine learning. The **BaggingClassifier** model object in **sklearn** implements bagging. The key parmaeters when creating a **BaggingClassifier** model are:

- **base_estimator** = None: The base model to use in the ensemble.
- **n_estimators** = 10: The number of boosting stages to perform.
- **max_samples** = 1.0: The number of samples to draw from X with replacement to train each base estimator.
- **max_features** = 1.0: The number of features to draw from X to train each base estimator (either a percentage or a number of features). 
- **bootstrap** = True: True for sampling with repalcement, without otherwise.
- **n_jobs** = 1: Number of jobs to run in parallel. -1 uses all available. 
- **verbose**=0: Controls how much output will be produced when methods are called - can be 0 (no output), 1, or 2 (maximum output). 

In [None]:
# Do the same job with random forests
my_model = sklearn.ensemble.BaggingClassifier(base_estimator = 
                                              sklearn.tree.DecisionTreeClassifier(criterion="entropy", 
                                                            min_samples_leaf = 0.05), \
                                              n_estimators=50)
my_model.fit(X_train,y_train)

In [None]:
# Make a set of predictions for the test data
y_pred = my_model.predict(X_valid)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

In [None]:
# Set up the parameter grid to seaerch
param_grid = [
 {'n_estimators': list(range(5, 25, 1)),
 'base_estimator': [sklearn.tree.DecisionTreeClassifier(), sklearn.linear_model.LogisticRegression(max_iter = 1000)]}
]

# Perform the search
my_tuned_model = sklearn.model_selection.GridSearchCV(sklearn.ensemble.BaggingClassifier(), param_grid, cv=5)
my_tuned_model.fit(X_train_plus_valid, y_train_plus_valid)

# Print details
print("Best parameters set found on development set:")
print(my_tuned_model.best_params_)
print(my_tuned_model.best_score_)

#### Random Forests

Bagging is a basic ensemble model approach to machine learning. The **RandomForestClassifier** model object in **sklearn** implements the bagging. The key parmaeters when creating a **RandomForestClassifier** model are:

- **criterion** = "gini": the criterion used for sselecting partitions during training. One of either "entropy" or "gini".
- **splitter** = "best": The approach used to split numeric data at each node in the tree. One of either "random" or "best".
- **max_depth** = None: The maximum depth that the tree is allowed to grow to. 
- **min_samples_split** = 2: The minimum number of samples required to split an internal node. 
- **min_samples_leaf** = 1: The minimum number of samples required to be at a leaf node.
- **n_estimators** = 100: The number of boosting stages to perform.
- **max_samples** = 1.0: The number of samples to draw from X with replacement to train each base estimator.
- **max_features** = 1.0: The number of features to draw from X to train each base estimator (either a percentage or a number of features). 
- **bootstrap** = True: True for sampling with repalcement, without otherwise.
- **n_jobs** = 1: Number of jobs to run in parallel. -1 uses all available. 
- **verbose**=0: Controls how much output will be produced when methods are called - can be 0 (no output), 1, or 2 (maximum output). 

In [None]:
# Do the same job with random forests
my_model = sklearn.ensemble.RandomForestClassifier(n_estimators=300, \
                                           max_features = 3,\
                                           min_samples_split=200)
my_model.fit(X_train,y_train)

In [None]:
# Make a set of predictions for the test data
y_pred = my_model.predict(X_valid)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

In [None]:
# Set up the parameter grid to seaerch
param_grid = [
 {'n_estimators': list(range(100, 501, 50)), 'max_features': list(range(1, 10, 2)), 'min_samples_split': list(range(20, 200, 50)) }
]

# Perform the search
my_tuned_model = sklearn.model_selection.GridSearchCV(sklearn.ensemble.RandomForestClassifier(), param_grid, cv=5)
my_tuned_model.fit(X_train_plus_valid, y_train_plus_valid)

# Print details
print("Best parameters set found on development set:")
print(my_tuned_model.best_params_)
print(my_tuned_model.best_score_)

#### Logistic Regression

**Logistic regression** models are a simple approach to binary classification. The **LogisticRegression** model object in **sklearn** implements logistic regression. The key parmaeters when creating a **LogisticRegression** model are:

- **penalty** = 'l2': Used to specify the type of regularisation used (one of 'l1', 'l2', 'elasticnet', or 'none').
- **C** = 1.0: Inverse of regularization strength. Must be a positive and smaller values specify stronger regularization.
- **warm_start** = False: When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.
- **max_iter** = 100: aximum number of iterations taken for the solvers to converge.
- **n_jobs** = 1: Number of jobs to run in parallel. -1 uses all available. 
- **verbose**=0: Controls how much output will be produced when methods are called - can be 0 (no output), 1, or 2 (maximum output). 


Logistic regression models really struggle to fit unless data is scaled to smalle ranges (e.g. -1 to 1).

In [None]:
cols = X_train.columns     # Save column names to avoid lsoing them when changing from pandas dataframe to numpy array
min_max_scaler = sklearn.preprocessing.MinMaxScaler(feature_range=(-1,1))
min_max_scaler.fit(X_train)
a = min_max_scaler.transform(X_train)
X_train_trans = pd.DataFrame(a, columns = cols) # Watch out for putting back in columns here

a = min_max_scaler.transform(X_valid)
X_valid_trans = pd.DataFrame(a, columns = cols) # Watch out for putting back in columns here

a = min_max_scaler.transform(X_test)
X_test_trans = pd.DataFrame(a, columns = cols) # Watch out for putting back in columns here

In [None]:
my_model = sklearn.linear_model.LogisticRegression(max_iter = 1000)
my_model.fit(X_train_trans,y_train)

In [None]:
# Make a set of predictions for the test data
y_pred = my_model.predict(X_valid_trans)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

#### Nearest Neighbour

**Nearest neighbour** models are a lazy  approach to machine learning. The **KNeighborsClassifier** model object in **sklearn** implements logistic regression. The key parmaeters when creating a **KNeighborsClassifier** model are:

- **n_neighbors** = 5: Number of neighbors to use.
- **weights** =  'uniform': Allows weighted nearest neighbour by setting to 'distance'
- **metric** = 'minkowski': The distance metric to copmpare neighbours.
- **n_jobs** = 1: Number of jobs to run in parallel. -1 uses all available. 
- **verbose**=0: Controls how much output will be produced when methods are called - can be 0 (no output), 1, or 2 (maximum output). 

In [None]:
# Do the same job with random forests
my_model = sklearn.neighbors.KNeighborsClassifier()
my_model = my_model.fit(X_train,y_train)

Assess the performance of the decision tree on the **validation set**

In [None]:
# Make a set of predictions for the test data
y_pred = my_model.predict(X_valid)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

In [None]:
# Set up the parameter grid to seaerch
param_grid = [
    {'n_neighbors':[1, 5, 15],
     'weights':['uniform', 'distance']}
]

# Perform the search
my_tuned_model = sklearn.model_selection.GridSearchCV(sklearn.neighbors.KNeighborsClassifier(), param_grid, cv=5)
my_tuned_model.fit(X_train_plus_valid, y_train_plus_valid)

# Print details
print("Best parameters set found on development set:")
print(my_tuned_model.best_params_)
print(my_tuned_model.best_score_)

#### Gradient Boosting

Gradient boosting models are especially effective for problems based on tabular data. The **GradientBoostingClassifier** model object in **sklearn** implements the gradient boosting algorithms. The key parmaeters when creating a **GradientBoostingClassifier** model are:

- **n_estimators** = 100: The number of boosting stages to perform. Large numbers usually perfrom very well. 
- **learning_rate** = 0.1: Learning rate shrinks the contribution of trees in later iterations.
- **subsample** = 1.0: The fraction of samples to be used for fitting the individual base learners. 
- **max_features**=None: The number of features to consider when looking for the best split. Can be a number of features, 'sqrt' for square root of total numebr of features, 'log2' for log base 2 of the total number of features, or None for the total number of features. 
- **min_samples_leaf** = 1: The minimum number of samples required to be at a leaf node in the tress in the ensemble.
- **validation_fraction** = 0.1: The proportion of training data to set aside as validation set for
    early stopping. 

In [None]:
# Do the same job with random forests
my_model = sklearn.ensemble.GradientBoostingClassifier(n_estimators=300, \
                                           min_samples_split=200)
my_model.fit(X_train,y_train)

Assess the performance of the model on the **validation set**

In [None]:
# Make a set of predictions for the test data
y_pred = my_model.predict(X_valid)

# Print performance details
print(sklearn.metrics.classification_report(y_valid, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(y_valid, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Choose parameters using a grid search

In [None]:
# Set up the parameter grid to seaerch
param_grid = [
 {'n_estimators': list(range(100, 501, 50)), 
  'max_features': list(range(1, 10, 2)), 
  'min_samples_split': list(range(20, 200, 50)) }
]

# Perform the search
my_tuned_model = sklearn.model_selection.GridSearchCV(sklearn.ensemble.GradientBoostingClassifier(), param_grid, cv=5)
my_tuned_model.fit(X_train_plus_valid, y_train_plus_valid)

# Print details
print("Best parameters set found on development set:")
print(my_tuned_model.best_params_)
print(my_tuned_model.best_score_)