# Week05 -

In this week we look at using ensembles of models to improve the performance of our models. We will look at the following:

* RandomForest
* AdaBoost
* Gradiant Boosting
* XG Boosting


## Introduction and Overview


In this notebook, we will reuse the Universal Bank dataset.

This time, we are developing a model to predict whether a customer will accept a personal loan offer. The dataset contains 5000 observations and 14 variables. The data is available on one of my GitHub repos.

## Install and import necessary packages

In [None]:
# You may need to install xgboost (it's not part of the sklearn package)
import sys
!conda install --yes --prefix {sys.prefix} xgboost

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: C:\Users\prath\anaconda3

  added / updated specs:
    - xgboost


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    _py-xgboost-mutex-2.0      |            cpu_0          12 KB
    conda-23.1.0               |   py39haa95532_0         946 KB
    libxgboost-1.7.3           |       hd77b12b_0         1.5 MB
    py-xgboost-1.7.3           |   py39haa95532_0         197 KB
    ruamel.yaml-0.17.21        |   py39h2bbff1b_0         174 KB
    ruamel.yaml.clib-0.2.6     |   py39h2bbff1b_1         101 KB
    xgboost-1.7.3              |   py39haa95532_0          12 KB
    ------------------------------------------------------------
                                           Total:         2.9 MB

The following NEW packages will be INSTALLED:


In [3]:
# import packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from xgboost import XGBClassifier

np.random.seed(1)

## Load data 

In [4]:
df = pd.read_csv('https://github.com/timcsmith/MIS536-Public/raw/master/Data/UniversalBank.csv')
df.head(5)

Unnamed: 0,ID,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


## Explore the dataset

In [5]:
# Explore the dataset
# read the first row of the dataset 
print(df.head())
print(df.columns)
print(df.describe())
print(df.info())

   ID  Age  Experience  Income  ZIP Code  Family  CCAvg  Education  Mortgage  \
0   1   25           1      49     91107       4    1.6          1         0   
1   2   45          19      34     90089       3    1.5          1         0   
2   3   39          15      11     94720       1    1.0          1         0   
3   4   35           9     100     94112       1    2.7          2         0   
4   5   35           8      45     91330       4    1.0          2         0   

   Personal Loan  Securities Account  CD Account  Online  CreditCard  
0              0                   1           0       0           0  
1              0                   1           0       0           0  
2              0                   0           0       0           0  
3              0                   0           0       0           0  
4              0                   0           0       0           1  
Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education'

## Clean/transform data (where necessary)

In [6]:
# based on findings from data exploration, we need to clean up colum names, as there are some leading whitespace characters
df.columns = [s.strip() for s in df.columns] 
df.columns

Index(['ID', 'Age', 'Experience', 'Income', 'ZIP Code', 'Family', 'CCAvg',
       'Education', 'Mortgage', 'Personal Loan', 'Securities Account',
       'CD Account', 'Online', 'CreditCard'],
      dtype='object')

Drop the columns we are not using as predictors (see previous notebooks -- we are given a subset of input variables to consider)

In [7]:
df = df.drop(columns=['ID', 'ZIP Code'])

In [8]:
# translation education categories into dummy vars
df = df.join(pd.get_dummies(df['Education'], prefix='Edu', drop_first=True))
df.drop('Education', axis=1, inplace = True)

df.head(3)

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Mortgage,Personal Loan,Securities Account,CD Account,Online,CreditCard,Edu_2,Edu_3
0,25,1,49,4,1.6,0,0,1,0,0,0,0,0
1,45,19,34,3,1.5,0,0,1,0,0,0,0,0
2,39,15,11,1,1.0,0,0,0,0,0,0,0,0


## Split data intro training and validation sets

In [9]:
# construct datasets for analysis
target = 'Personal Loan'
predictors = list(df.columns)
predictors.remove(target)
X = df[predictors]
y = df[target]

In [10]:
# create the training set and the test set 
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=1)

## Prediction with Decision Tree (using default parameters)



You can find details about SKLearm's DecisionTree classifier [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

Create a decision tree using all of the default parameters

In [11]:
dtree=DecisionTreeClassifier()

Fit the model to the training data

In [12]:
_ = dtree.fit(X_train, y_train)

Review of the performance of the model on the validation/test data

In [13]:
y_pred = dtree.predict(X_test)

In [14]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.9060402684563759
Accuracy Score:   0.9873333333333333
Precision Score:  0.9642857142857143
F1 Score:         0.9342560553633219


Save the recall result from this model

In [15]:
dtree_recall = recall_score(y_test, y_pred)

## Prediction with RandomForest (using default parameters)

Like all our classifiers, RandomeForestClassifier has a number of parameters that can be adjusted/tuned. In this example below, we simply accept the defaults. You may want to experiment with changing the defaul values and also use GridSearchCV to explore ranges of values.

* n_estimators: The number of trees in the forsest
    - A deeper tree might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is 100.  
* max_depth: The maximum depth per tree. 
    - Deeper trees might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is None, which allows the tree to grow without constraint.
* See the SciKit Learn documentation for more details. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


In [16]:
rforest = RandomForestClassifier()

In [17]:
_ = rforest.fit(X_train, y_train)

In [18]:
y_pred = rforest.predict(X_test)

In [19]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.8456375838926175
Accuracy Score:   0.9833333333333333
Precision Score:  0.984375
F1 Score:         0.9097472924187726


Save the recall result from this model

In [20]:
rforest_recall = recall_score(y_test, y_pred)

RandomForestClassifier using RandomizedSearchCV

In [22]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'n_estimators':np.arange(100,250),
    'min_samples_split': np.arange(1,25),  
    'min_samples_leaf': np.arange(1,20),
    'min_weight_fraction_leaf':np.arange(0.0,1.5),
    'min_impurity_decrease': np.arange(0.0001, 0.01, 0.0005),
    'max_leaf_nodes': np.arange(5,50), 
    'max_depth': np.arange(1,50), 
    'criterion': ['entropy', 'gini'],
}


rand_search = RandomizedSearchCV(estimator = rforest,param_distributions=param_grid, cv=kfolds, n_iter=400,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_
import warnings
warnings.filterwarnings("ignore")

Fitting 5 folds for each of 400 candidates, totalling 2000 fits
The best recall score is 0.8609226594301221
... with parameters: {'n_estimators': 225, 'min_weight_fraction_leaf': 0.0, 'min_samples_split': 13, 'min_samples_leaf': 3, 'min_impurity_decrease': 0.0006000000000000001, 'max_leaf_nodes': 43, 'max_depth': 40, 'criterion': 'entropy'}


In [23]:
y_pred = rand_search.predict(X_test)

In [24]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.7718120805369127
Accuracy Score:   0.9746666666666667
Precision Score:  0.9663865546218487
F1 Score:         0.8582089552238806


In [25]:
rforest_rsearch_recall = recall_score(y_test, y_pred)

## Prediction with ADABoost (using default parameters)

Like all our classifiers, ADABoostClassifier has a number of parameters that can be adjusted/tuned. In this example below, we simply accept the defaults. You may want to experiment with changing the defaul values and also use GridSearchCV to explore ranges of values.

* max_depth: The maximum depth per tree. 
    - A deeper tree might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is None (meaning, the tree can grow to a point where all leaves have 1 observation).
* learning_rate: The learning rate determines the step size at each iteration while your model optimizes toward its objective. 
    - A low learning rate makes computation slower, and requires more rounds to achieve the same reduction in residual error as a model with a high learning rate. But it optimizes the chances to reach the best optimum.
    - Larger learning rates may not converge on a solution.
    - The value must be between 0 and 1. Default is 0.3.
* n_estimators: The number of trees in our ensemble. 
    - Equivalent to the number of boosting rounds.
    - The value must be an integer greater than 0. Default is 100.
* See the SciKit Learn documentation for more details. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html

In [26]:
aboost = AdaBoostClassifier()

In [27]:
_ = aboost.fit(X_train, y_train)

In [28]:
y_pred = aboost.predict(X_test)

In [29]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.7248322147651006
Accuracy Score:   0.9626666666666667
Precision Score:  0.8780487804878049
F1 Score:         0.7941176470588235


Save the recall result from this model

In [30]:
aboost_recall = recall_score(y_test, y_pred)

## Prediction with GradientBoostingClassifier

Like all our classifiers, GradientBoostingClassifier has a number of parameters that can be adjusted/tuned. In this example below, we simply accept the defaults. You may want to experiment with changing the defaul values and also use GridSearchCV to explore ranges of values.

* max_depth: The maximum depth per tree. 
    - A deeper tree might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is None (meaning, the tree can grow to a point where all leaves have 1 observation).
* learning_rate: The learning rate determines the step size at each iteration while your model optimizes toward its objective. 
    - A low learning rate makes computation slower, and requires more rounds to achieve the same reduction in residual error as a model with a high learning rate. But it optimizes the chances to reach the best optimum.
    - Larger learning rates may not converge on a solution.
    - The value must be between 0 and 1. Default is 0.3.
* n_estimators: The number of trees in our ensemble. 
    - Equivalent to the number of boosting rounds.
    - The value must be an integer greater than 0. Default is 100.
* See the SciKit Learn documentation for more details. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

In [31]:
gboost = GradientBoostingClassifier()

In [32]:
_ = gboost.fit(X_train, y_train)

In [33]:
y_pred = gboost.predict(X_test)

In [34]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.8657718120805369
Accuracy Score:   0.9826666666666667
Precision Score:  0.9555555555555556
F1 Score:         0.9084507042253522


Save the recall result from this model

In [35]:
gboost_recall = recall_score(y_test, y_pred)

## Prediction with XGBoost

Like all our classifiers, XGBoost has a number of parameters that can be adjusted/tuned. In this example below, we simply accept the defaults. You may want to experiment with changing the defaul values and also use GridSearchCV to explore ranges of values.

* max_depth: The maximum depth per tree. 
    - A deeper tree might increase the performance, but also the complexity and chances to overfit.
    - The value must be an integer greater than 0. Default is 6.
* learning_rate: The learning rate determines the step size at each iteration while your model optimizes toward its objective. 
    - A low learning rate makes computation slower, and requires more rounds to achieve the same reduction in residual error as a model with a high learning rate. But it optimizes the chances to reach the best optimum.
    - The value must be between 0 and 1. Default is 0.3.
* n_estimators: The number of trees in our ensemble. 
    - Equivalent to the number of boosting rounds.
    - The value must be an integer greater than 0. Default is 100.
* colsample_bytree: Represents the fraction of columns to be randomly sampled for each tree. 
    - It might improve overfitting.
    - The value must be between 0 and 1. Default is 1.
* subsample: Represents the fraction of observations to be sampled for each tree. 
    - A lower values prevent overfitting but might lead to under-fitting.
    - The value must be between 0 and 1. Default is 1.
* See the XGBoost documentation for more details. https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn 

In [36]:
xgboost = XGBClassifier()

In [37]:
_ = xgboost.fit(X_train, y_train)

In [38]:
y_pred = xgboost.predict(X_test)

In [39]:
print(f"{'Model':^18}{'Score':^18}")
print("************************************")
print(f"{'>> Recall Score:':18}{recall_score(y_test, y_pred)}")
print(f"{'Accuracy Score: ':18}{accuracy_score(y_test, y_pred)}")
print(f"{'Precision Score: ':18}{precision_score(y_test, y_pred)}")
print(f"{'F1 Score: ':18}{f1_score(y_test, y_pred)}")

      Model             Score       
************************************
>> Recall Score:  0.8926174496644296
Accuracy Score:   0.9853333333333333
Precision Score:  0.9568345323741008
F1 Score:         0.9236111111111113


Save the recall result from this model

In [40]:
xgboost = recall_score(y_test, y_pred)

## Step 6: Summarize results    

As usual -- in this section you provide a recap your approach, results, and discussion of findings. 


In [42]:
print("Recall scores...")
print(f"{'Decision Tree:':33}{dtree_recall}")
print(f"{'Random Forest:':33}{rforest_recall}")
print(f"{'Random Forest using RandomSearch:':18}{rforest_rsearch_recall}")
print(f"{'Ada Boosted Tree:':33}{aboost_recall}")
print(f"{'Gradient Tree:':33}{gboost_recall}")
print(f"{'XGBoost Tree:':33}{xgboost}")

Recall scores...
Decision Tree:                   0.9060402684563759
Random Forest:                   0.8456375838926175
Random Forest using RandomSearch:0.7718120805369127
Ada Boosted Tree:                0.7248322147651006
Gradient Tree:                   0.8657718120805369
XGBoost Tree:                    0.8926174496644296


The Recall value of the Random Forest is more than the Random Forest using Random search. We are considering Recall as the metric as False Positives gives more loss. So minimal no. of False Positives are preferred. It might perform better if the parameters and values are changed in Random Search. In this case, Random Forest with default values fit the data better than the range of values and parameters considered in Random search.