### Introduction
This churn modeling notebook is associated with focused customer retention programs, i.e. predict behavior to retain customers and uses a bank's churning dataset to build a classifier for predicting customer churn. The data set is obtained from Kaggle [here](https://www.kaggle.com/shrutimechlearn/churn-modelling). The goal is to predict whether a bank customer will churn or not, that is, whether the customer will leave the bank (close account) or continue to be a customer given customer details. 

For this task, different ensemble methods are compared for the binary classification modeling. An ensemble is a composite modeling that combines few low performing classifiers to create an improved classifier.
In particular the models employed are:
1. Voting (Hard and Soft)
2. Bagging
3. Boosting (AdaBoost, XGBoost, LightGBM)
4. Stacking

The performance metrics used is accuracy.

#### Let's load the dataset

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split


data = pd.read_csv('bank_churn.csv')

In [2]:
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Firstly, some preprocessing. Remove the columns: RowNumber, CustomerID, Surname. Convert categorical data into dummy variables (with dropping). Split data into 80%-20% train/test sets.

In [3]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [4]:
# Remove columns
data.drop(['RowNumber', 'CustomerId', 'Surname'], axis = 1, inplace = True)

# Create dummy variables
data = pd.get_dummies(data = data, columns = ['Geography','Gender'], drop_first = True)
data

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2,0.00,1,1,1,101348.88,1,0,0,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8,159660.80,3,1,0,113931.57,1,0,0,0
3,699,39,1,0.00,2,0,0,93826.63,0,0,0,0
4,850,43,2,125510.82,1,1,1,79084.10,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
9995,771,39,5,0.00,2,1,0,96270.64,0,0,0,1
9996,516,35,10,57369.61,1,1,1,101699.77,0,0,0,1
9997,709,36,7,0.00,1,0,1,42085.58,1,0,0,0
9998,772,42,3,75075.31,2,1,0,92888.52,1,1,0,1


### Extract X and y, split into train test and scale the data

In [5]:
# Extract X and y
X = data.drop('Exited', axis = 1)
y = data.Exited

In [6]:
# Train test split to 80-20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 862)

In [7]:
# Scale the data
scaler = StandardScaler()
X_trainS = scaler.fit_transform(X_train)
X_testS = scaler.transform(X_test)

### Data Modeling with Ensemble Methods
Let's build an ensemble classifier using each method, and provide evaluation (accuracy) on the test set. Also, we will conduct the appropriate preprocessing (if needed) and further hyperparameter tuning for some.

### Voting (Soft) Classifier
Let's build a soft voting classifier.
Using 5 classifiers, where two of them are same but with different hyperparameters. 

In [8]:
# Using Logistic, RandomForest and Boosting classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier 
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier

from sklearn.metrics import accuracy_score

In [9]:
# Define the individual models

clf1 = LogisticRegression(penalty = 'none',random_state=1) # Unregularized
clf2 = LogisticRegression(penalty = 'l1', solver = 'liblinear', random_state=1) # Regularized
clf3 = RandomForestClassifier(random_state=1)
clf4 = GradientBoostingClassifier(random_state=1)
clf5 = KNeighborsClassifier()

eclf_soft = VotingClassifier(estimators=[('lr1', clf1),('lr2',clf2),('rf',clf3),('gb',clf4),('knn',clf5)], voting='soft')

In [10]:
# Set up parameter set for differnet weak learners
param = [{'lr1__solver':['lbfgs','newton-cg'],
          
          'lr2__C':np.logspace(-10, 10, 5),
          
          'rf__n_estimators':np.linspace(100, 1000, 5, dtype = int),
                    
          'gb__learning_rate':[0.01,0.1],
          
          'knn__n_neighbors': [5,10],
          'knn__p': [1,2] }]


grid = GridSearchCV(estimator = eclf_soft, param_grid = param, cv=2, n_jobs=-2, scoring='accuracy')

# Fit the voting classifier
grid.fit(X_trainS, y_train)
y_pred = grid.predict(X_testS)

# Perform prediction on the fitted voting classifier
print('Soft Voting:',np.mean(grid.predict(X_testS) == y_test))

# https://stats.stackexchange.com/questions/320156/hard-voting-versus-soft-voting-in-ensemble-based-methods

Soft Voting: 0.85


### Voting (Hard) Classifier
Now let's do the same, but with a hard voting classifier and we will later compare the result with the soft classifier.

In [11]:
# Use as many boxes as you need
eclf_hard = VotingClassifier(estimators=[('lr1', clf1),('lr2',clf2),('rf',clf3),('gb',clf4),('knn',clf5)], voting='hard')

grid_h = GridSearchCV(estimator = eclf_hard, param_grid = param, cv=2, n_jobs=-2, scoring='accuracy')

# Fit the voting classifier
grid_h.fit(X_trainS, y_train)
y_pred = grid_h.predict(X_testS)

# Perform prediction on the fitted voting classifier
print('Hard Voting:',np.mean(grid_h.predict(X_testS) == y_test))

Hard Voting: 0.84


**Observation:** Soft Voting gave a little better score of 0.85 accuracy than hard voting with 0.84.

As a comparison, let's also fit the individual models and see if there's really an improvement with the voting classifier.

In [12]:
# 1. Set up the names and estimators selected in a list

names = ['LR1_unreg','LR2_reg','RF','GB','KNN']

classifiers = [LogisticRegression(penalty = 'none',random_state=1),
               LogisticRegression(penalty = 'l1', solver = 'liblinear', random_state=1),
               RandomForestClassifier(random_state=1),
               GradientBoostingClassifier(random_state=1),
               KNeighborsClassifier()]

parameters = [ {'solver':['lbfgs','newton-cg']},
              {'C':np.logspace(-10, 10, 5)},
              {'n_estimators':np.linspace(100, 1000, 5, dtype = int)},
              {'learning_rate':[0.01,0.1]}, 
              {'n_neighbors': [5,10],'p': [1,2]} ]


result =[] # empty list to store accuracy score
best_params = [] #empty list to store best parameters

# 2. Loop to Instantiate gridsearch cross validation and fit for the model in pipeline
for name, classifier, params in zip(names, classifiers, parameters):
    
    grid = GridSearchCV(estimator = classifier, param_grid = params, cv=2, n_jobs=-2, scoring='accuracy')
    grid.fit(X_trainS, y_train)

    result.append(accuracy_score(y_test,grid.predict(X_testS)))
    best_params.append(grid.best_params_)
    
    
# 3. Ranked order of classifiers based on accuracy
data_features = sorted(list(zip(names,best_params,result)), key = lambda x: x[2], reverse = True)
pd.DataFrame(data_features, columns = ['classifier','params','score'])

Unnamed: 0,classifier,params,score
0,RF,{'n_estimators': 775},0.8535
1,GB,{'learning_rate': 0.1},0.8525
2,KNN,"{'n_neighbors': 10, 'p': 1}",0.828
3,LR2_reg,{'C': 1.0},0.8035
4,LR1_unreg,{'solver': 'lbfgs'},0.803


**Observation:** Random Forest (0.8535) and Gradient Boosting (0.8525) classifiers gave a little better scoring than both soft and hard voting classifiers. Interestingly, the voting didn't outperform all models even after tuning. Perhaps more tuning could improve results.

### Bagged Logistic Regression
It is also called Bootstrap Aggregating and helps reduce little bias and variance. It works as training all classifiers and then the ensemble makes prediction by aggregating results from all classifiers usually through majority vote. Weak learners can be trained parallelly, making it efficient.
Random Forest is an example of bagging. 

In [13]:
# Use as many boxes as you need
from sklearn.ensemble import BaggingClassifier

# Instantiate the model
bag_clf = BaggingClassifier(LogisticRegression(penalty = 'l1', solver = 'liblinear', random_state=1), 
                           n_estimators = 100)


# Fit the model
bag_clf.fit(X_trainS, y_train)

# Prediction
print('Bagged:',np.mean(bag_clf.predict(X_testS) == y_test))

Bagged: 0.8035


**Observation:**  Bagged Logistic Regression gives a low score and is lower than voting classifier. Hyperparameter tuning could result in a better score.

## Boosting
Based on the principle of training models sequentially, each trying to correct its predecessor.

### AdaBoost
Ada Boosting is adaptive Boosting. The concept is that instead of modeling the residual, model the data at each round, but
assign weights to observations. Misclassified observations will be focused more during training, to improve error. Each learner will be assigned a weight as well, based on the errors it makes based on the learning rate. 
AdaBoost is sensitive to noisy data and outliers because it tries to fit each point.

If base estimator is None, then the base estimator is DecisionTreeClassifier initialized with max_depth=1

In [14]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# With parameter tuning

# Instantiate the model
adabc = AdaBoostClassifier(random_state = 1)

adabc_param = {'n_estimators': [10,50,100],
    'base_estimator': [None,LogisticRegression(),SVC(probability=True, kernel='linear')], #DecisionTreeClassifier(max_depth=5)
    'learning_rate': [0.1,1]}

adabc_grid = GridSearchCV(estimator=adabc, param_grid=adabc_param, cv = 2, scoring = 'accuracy', n_jobs = -2)

# Fit the model
adabc_grid.fit(X_trainS, y_train)
# rmse for regression, and logloss for classification, mean average precision for ranking

# Prediction
print("Best params:",adabc_grid.best_params_)
print("Accuracy tuned XGB",np.mean(adabc_grid.predict(X_testS) == y_test))

# https://www.kaggle.com/prashant111/adaboost-classifier-tutorial

Best params: {'base_estimator': None, 'learning_rate': 0.1, 'n_estimators': 100}
Accuracy tuned XGB 0.8455


### XGBoost
Similar to Gradient Boosting but much more efficient and improved in terms of regularization,parallel processing, handling missing values and early stopping. It uses decision trees as base learners.

In [15]:
# Split model to get a validation set
X_t, X_valid, y_t, y_valid = train_test_split(X_train, y_train, test_size = 0.25, random_state = 1) 

# Scale the data
scaler = StandardScaler()
X_tS = scaler.fit_transform(X_t)
X_vS = scaler.transform(X_valid)
X_teS = scaler.transform(X_test)

In [16]:
# pip install xgboost
from xgboost import XGBClassifier

# Instantiate the model
xgbc = XGBClassifier(objective= 'binary:logistic', n_estimators = 100,
                        max_depth = 3, learning_rate = 0.1, n_jobs = -2, random_state = 1)

# Fit the model
xgbc.fit(X_tS, y_t, eval_set = [(X_vS, y_valid)], 
            early_stopping_rounds = 5) # Implementing early stopping

# Prediction
print("Accuracy XGB",np.mean(xgbc.predict(X_teS) == y_test))

  import pandas.util.testing as tm


[0]	validation_0-error:0.16150
Will train until validation_0-error hasn't improved in 5 rounds.
[1]	validation_0-error:0.15700
[2]	validation_0-error:0.15900
[3]	validation_0-error:0.15950
[4]	validation_0-error:0.14600
[5]	validation_0-error:0.14600
[6]	validation_0-error:0.14600
[7]	validation_0-error:0.14500
[8]	validation_0-error:0.14500
[9]	validation_0-error:0.14500
[10]	validation_0-error:0.14500
[11]	validation_0-error:0.14450
[12]	validation_0-error:0.14050
[13]	validation_0-error:0.14050
[14]	validation_0-error:0.14050
[15]	validation_0-error:0.14000
[16]	validation_0-error:0.13750
[17]	validation_0-error:0.14000
[18]	validation_0-error:0.14000
[19]	validation_0-error:0.13900
[20]	validation_0-error:0.14000
[21]	validation_0-error:0.13900
Stopping. Best iteration:
[16]	validation_0-error:0.13750

Accuracy XGB 0.852


In [17]:
# With parameter tuning

# Instantiate the model
xgbc = XGBClassifier(objective= 'binary:logistic',random_state = 1)

xgbc_param = {'n_estimators': [100,200,500,1000],
    'max_depth': [2,3,5,10],
    'learning_rate': [0.1, 0.01, 0.05]}

xgbc_grid = GridSearchCV(estimator=xgbc, param_grid=xgbc_param, cv = 2, scoring = 'accuracy', n_jobs = -2)

# Fit the model
xgbc_grid.fit(X_trainS, y_train, eval_metric = 'logloss')
# rmse for regression, and logloss for classification, mean average precision for ranking

# Prediction
print("Best params:",xgbc_grid.best_params_)
print("Accuracy tuned XGB",np.mean(xgbc_grid.predict(X_testS) == y_test))

# References
# https://www.mikulskibartosz.name/xgboost-hyperparameter-tuning-in-python-using-grid-search/
# https://github.com/dmlc/xgboost/blob/master/doc/parameter.rst
# https://xgboost.readthedocs.io/en/latest/python/python_api.html

Best params: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 500}
Accuracy tuned XGB 0.853


**Observation:** XGBoost gave a comparable score (0.853) with Random Forest Classifier (0.8535) and also is better than the voting classifiers. 

### Light GBM
Works well with large data and uses decision trees as base learners.Other algorithms grow the tree as a whole level, while Light GBM grows the tree at the leaf level, hence much faster. Due to producing and handling complex trees, it gives higher accuracy but may overfit data so need to control the maximum depth.

In [18]:
# pip install lightgbm
from lightgbm import LGBMClassifier

# Instantiate the model
lgbmc = LGBMClassifier(objective= 'binary', n_estimators = 100, metric = 'auc', num_leaves = 31,
                        learning_rate = 0.1, n_jobs = -2, random_state = 1)

# Fit the model
lgbmc.fit(X_tS, y_t, eval_set = [(X_vS, y_valid)], eval_metric='auc',
            early_stopping_rounds = 5) # Implementing early stopping

# Prediction
print("Accuracy LGBM",np.mean(lgbmc.predict(X_teS) == y_test))

[1]	valid_0's auc: 0.854731
Training until validation scores don't improve for 5 rounds
[2]	valid_0's auc: 0.857237
[3]	valid_0's auc: 0.860349
[4]	valid_0's auc: 0.860698
[5]	valid_0's auc: 0.860753
[6]	valid_0's auc: 0.864412
[7]	valid_0's auc: 0.866179
[8]	valid_0's auc: 0.867591
[9]	valid_0's auc: 0.867086
[10]	valid_0's auc: 0.866502
[11]	valid_0's auc: 0.867104
[12]	valid_0's auc: 0.866083
[13]	valid_0's auc: 0.865924
Early stopping, best iteration is:
[8]	valid_0's auc: 0.867591
Accuracy LGBM 0.845


In [19]:
# Use as many boxes as you need

# With parameter tuning

# Instantiate the model
lgbmc = LGBMClassifier(objective='binary', random_state = 1)

lgbmc_param = {'n_estimators':[100,200,500,1000],
               'metric': ['auc', 'binary_logloss'],
               'num_leaves': [10,31,50],
               'learning_rate':[0.01, 0.1],
               'min_data_in_leaf': [20, 30, 50, 100],}

lgbmc_grid = GridSearchCV(estimator=lgbmc, param_grid=lgbmc_param, cv = 2, scoring = 'accuracy', n_jobs = -2)

# Fit the model
lgbmc_grid.fit(X_trainS, y_train)
# rmse for regression, and logloss for classification, mean average precision for ranking

# Prediction
print("\nBest params:",lgbmc_grid.best_params_)
print("\nAccuracy tuned LGBM",np.mean(lgbmc_grid.predict(X_testS) == y_test))


# https://www.kaggle.com/garethjns/microsoft-lightgbm-with-parameter-tuning-0-823


Best params: {'learning_rate': 0.01, 'metric': 'auc', 'min_data_in_leaf': 20, 'n_estimators': 500, 'num_leaves': 10}

Accuracy tuned LGBM 0.852


**Observation:** Light GBM gives good result of 0.852 with hyper parameter tuning yet overall XGBoost gave best result of 0.853 accuracy score, though all other boosting scores are comparable too. Maybe with more hyperparameter tuning the results can improve.

### Stacking
Lastly, let's do this with Stacking.  
Using the same models from the voting classifiers.  
Using Random Forest as blender function.

In [20]:
# Use as many boxes as you need
models = {'lr1_unreg': clf1, 'lr2_reg': clf2, 'rf': clf3, 'gb': clf4, 'knn': clf5}

# Also define the blender

from sklearn.svm import LinearSVC
from sklearn.svm import SVC

blender= RandomForestClassifier(random_state=1)

In [21]:
# Split the training data into two parts, one to train the weak learners, another to train the blender
X_trainS1, X_trainS2, y_train1, y_train2 = train_test_split(X_trainS, y_train, test_size = 0.5, random_state = 1)

In [22]:
# Train the weak learners
for name, model in models.items():
    model.fit(X_trainS1, y_train1)

In [23]:
# Train the blender
# Get the prediction
predictions = pd.DataFrame() # Set up a dataframe to store the predictions
for name, model in models.items():
    predictions[name] = model.predict_proba(X_trainS2)[:,1] # Taking probability of positive class in binary classification

# Get the blender
scaler_blend = StandardScaler() # Scale the predictions 
predictions_scale = scaler_blend.fit_transform(predictions)
blender.fit(predictions_scale, y_train2)

RandomForestClassifier(random_state=1)

In [24]:
# Perform evaluation
# First send the data through the weak learners
predictions = pd.DataFrame() # Set up a dataframe to store the predictions
for name, model in models.items():
    predictions[name] = model.predict_proba(X_testS)[:,1] # Taking probability of positive class in binary classification
    
# Prediction through the blender, and evaluate
predictions_scale = scaler_blend.transform(predictions)
#np.sqrt(mean_squared_error(blender.predict(predictions_scale), y_test))
np.mean(blender.predict(predictions_scale) == y_test)

0.851

**Observation:** Stacking accuracy(0.851) is somewhat close and comparable to LGBM results(0.852). 
Overall, Boosting shows highest accuracy and Bagging shows the least.
Individually, Random Forest accuracy score is the highest overall at 0.8535. 