### This is the assignment for Gradient boosting. In this assignment, we will utilize the course example and modify it. The purpose of this assignment is to find out how changing of hyper parameters will change model performance. The first part of this assignment will be reusing some codes from the course material. 

In [0]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import ensemble
from sklearn import datasets
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error

In [0]:
df = pd.read_csv((
    "https://raw.githubusercontent.com/Thinkful-Ed/data-201-resources/"
    "master/ESS_practice_data/ESSdata_Thinkful.csv")).dropna()

# Definine outcome and predictors.
# Set our outcome to 0 and 1.
y = df['partner'] - 1
X = df.loc[:, ~df.columns.isin(['partner', 'cntry', 'idno'])]

# Make the categorical variable 'country' into dummies.
X = pd.concat([X, pd.get_dummies(df['cntry'])], axis=1)

# Create training and test sets.
offset = int(X.shape[0] * 0.9)

# Put 90% of the data in the training set.
X_train, y_train = X[:offset], y[:offset]

# And put 10% in the test set.
X_test, y_test = X[offset:], y[offset:]

In [3]:
# We'll make 500 iterations, use 2-deep trees, and set our loss function.
params = {'n_estimators': 500,
          'max_depth': 2,
          'loss': 'deviance'}

# Initialize and fit the model.
clf = ensemble.GradientBoostingClassifier(**params)
clf.fit(X_train, y_train)

predict_train = clf.predict(X_train)
predict_test = clf.predict(X_test)

# Accuracy tables.
table_train = pd.crosstab(y_train, predict_train, margins=True)
table_test = pd.crosstab(y_test, predict_test, margins=True)

train_tI_errors = table_train.loc[0.0,1.0] / table_train.loc['All','All']
train_tII_errors = table_train.loc[1.0,0.0] / table_train.loc['All','All']

test_tI_errors = table_test.loc[0.0,1.0]/table_test.loc['All','All']
test_tII_errors = table_test.loc[1.0,0.0]/table_test.loc['All','All']

print((
    'Training set accuracy:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}\n\n'
    'Test set accuracy:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}'
).format(train_tI_errors, train_tII_errors, test_tI_errors, test_tII_errors))


Training set accuracy:
Percent Type I errors: 0.04650845608292417
Percent Type II errors: 0.17607746863066012

Test set accuracy:
Percent Type I errors: 0.06257668711656442
Percent Type II errors: 0.18527607361963191


### Above code is from course material. Starting from below, we will try changing the n_estimater, max_deepth, and different loss function to see if they will help increase model performance. 

In [4]:
# We'll make 1000 iterations, use 2-deep trees, and set our loss function.
params1 = {'n_estimators': 1000,
          'max_depth': 2,
          'loss': 'deviance'}

# Initialize and fit the model.
clf_EstChanged = ensemble.GradientBoostingClassifier(**params1)
clf_EstChanged.fit(X_train, y_train)

predict_train_1 = clf_EstChanged.predict(X_train)
predict_test_1 = clf_EstChanged.predict(X_test)

# Accuracy tables.
table_train_1 = pd.crosstab(y_train, predict_train_1, margins=True)
table_test_1 = pd.crosstab(y_test, predict_test_1, margins=True)

train_tI_errors_1 = table_train_1.loc[0.0,1.0] / table_train_1.loc['All','All']
train_tII_errors_1 = table_train_1.loc[1.0,0.0] / table_train_1.loc['All','All']

test_tI_errors_1 = table_test_1.loc[0.0,1.0]/table_test_1.loc['All','All']
test_tII_errors_1 = table_test_1.loc[1.0,0.0]/table_test_1.loc['All','All']

print((
    'Training set accuracy with Changed n_estimators:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}\n\n'
    'Test set accuracy with Changed n_estimators:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}\n\n'
).format(train_tI_errors_1, train_tII_errors_1, test_tI_errors_1, test_tII_errors_1))

# We'll make 500 iterations, use 4-deep trees, and set our loss function.
params2 = {'n_estimators': 500,
          'max_depth': 4,
          'loss': 'deviance'}

# Initialize and fit the model.
clf_depth = ensemble.GradientBoostingClassifier(**params2)
clf_depth.fit(X_train, y_train)

predict_train_2 = clf_depth.predict(X_train)
predict_test_2 = clf_depth.predict(X_test)

# Accuracy tables.
table_train_2 = pd.crosstab(y_train, predict_train_2, margins=True)
table_test_2 = pd.crosstab(y_test, predict_test_2, margins=True)

train_tI_errors_2 = table_train_2.loc[0.0,1.0] / table_train_2.loc['All','All']
train_tII_errors_2 = table_train_2.loc[1.0,0.0] / table_train_2.loc['All','All']

test_tI_errors_2 = table_test_2.loc[0.0,1.0]/table_test_2.loc['All','All']
test_tII_errors_2 = table_test_2.loc[1.0,0.0]/table_test_2.loc['All','All']

print((
    'Training set accuracy with Change of Max Depth:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}\n\n'
    'Test set accuracy with Changed of Max Depth:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}\n\n'
).format(train_tI_errors_2, train_tII_errors_2, test_tI_errors_2, test_tII_errors_2))

# We'll make 500 iterations, use 2-deep trees, and set our loss function.
params3 = {'n_estimators': 500,
          'max_depth': 2,
          'loss': 'exponential'}

# Initialize and fit the model.
clf_loss = ensemble.GradientBoostingClassifier(**params3)
clf_loss.fit(X_train, y_train)

predict_train_3 = clf_loss.predict(X_train)
predict_test_3 = clf_loss.predict(X_test)

# Accuracy tables.
table_train_3 = pd.crosstab(y_train, predict_train_3, margins=True)
table_test_3 = pd.crosstab(y_test, predict_test_3, margins=True)

train_tI_errors_3 = table_train_3.loc[0.0,1.0] / table_train_3.loc['All','All']
train_tII_errors_3 = table_train_3.loc[1.0,0.0] / table_train_3.loc['All','All']

test_tI_errors_3 = table_test_3.loc[0.0,1.0]/table_test_3.loc['All','All']
test_tII_errors_3 = table_test_3.loc[1.0,0.0]/table_test_3.loc['All','All']

print((
    'Training set accuracy with changed of loss function:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}\n\n'
    'Test set accuracy with changed of loss function:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}'
).format(train_tI_errors_3, train_tII_errors_3, test_tI_errors_3, test_tII_errors_3))

Training set accuracy with Changed n_estimators:
Percent Type I errors: 0.044189852700491
Percent Type II errors: 0.1692580469176214

Test set accuracy with Changed n_estimators:
Percent Type I errors: 0.07116564417177915
Percent Type II errors: 0.18036809815950922


Training set accuracy with Change of Max Depth:
Percent Type I errors: 0.01950354609929078
Percent Type II errors: 0.11824877250409166

Test set accuracy with Changed of Max Depth:
Percent Type I errors: 0.08588957055214724
Percent Type II errors: 0.18036809815950922


Training set accuracy with changed of loss function:
Percent Type I errors: 0.04841789416257501
Percent Type II errors: 0.1778505182760502

Test set accuracy with changed of loss function:
Percent Type I errors: 0.0638036809815951
Percent Type II errors: 0.18773006134969325


### According to above model performance, we can see that increasing the n-estimater, tree depth, and change of loss function tend to help increase accuracy on the training data set. However, the performance on the testing data set is decreasing, which is a sign of overfitting. Next, we will try GridSeachCV and see if we can fine tune our hyper-parameters. 

In [5]:
from sklearn.model_selection import GridSearchCV
clf_n = {'n_estimators':np.arange(100,1000,100)}

clf_n_grid_search = GridSearchCV(clf, clf_n, cv=5, scoring="balanced_accuracy")
clf_n_grid_search.fit(X_train, y_train)

print(('Best number of n_estimator is: {}').format(clf_n_grid_search.best_params_))


Best number of n_estimator is: {'n_estimators': 800}


In [6]:
params4 = {'n_estimators': 800,
          'max_depth': 1,
          'loss': 'deviance'}

clf_4 = ensemble.GradientBoostingClassifier(**params4)
clf_4.fit(X_train, y_train)

clf_depth = {'max_depth':np.arange(1,10,1)}

clf_depth_grid_search = GridSearchCV(clf_4, clf_depth, cv=5, scoring="balanced_accuracy")
clf_depth_grid_search.fit(X_train, y_train)

print(('Best number of max depth is: {}').format(clf_depth_grid_search.best_params_))

Best number of max depth is: {'max_depth': 1}


In [7]:
params5 = {'n_estimators': 800,
          'max_depth': 1,
          'subsample': 0.2,
          'loss': 'deviance'}

clf_5 = ensemble.GradientBoostingClassifier(**params5)
clf_5.fit(X_train, y_train)

clf_subsample = {'subsample':np.arange(0.1,1,0.1)}

clf_subsample_grid_search = GridSearchCV(clf_5, clf_subsample, cv=5, scoring="balanced_accuracy")
clf_subsample_grid_search.fit(X_train, y_train)

print(('Best number of subsample is: {}').format(clf_subsample_grid_search.best_params_))

Best number of subsample is: {'subsample': 0.1}


In [8]:
params6 = {'n_estimators': 800,
          'max_depth': 1,
          'loss': 'deviance',
          'subsample': 0.3}

# Initialize and fit the model.
clf_CV = ensemble.GradientBoostingClassifier(**params6)
clf_CV.fit(X_train, y_train)

predict_train_6 = clf_CV.predict(X_train)
predict_test_6 = clf_CV.predict(X_test)

# Accuracy tables.
table_train_6 = pd.crosstab(y_train, predict_train_6, margins=True)
table_test_6 = pd.crosstab(y_test, predict_test_6, margins=True)

train_tI_errors_6 = table_train_6.loc[0.0,1.0] / table_train_6.loc['All','All']
train_tII_errors_6 = table_train_6.loc[1.0,0.0] / table_train_6.loc['All','All']

test_tI_errors_6 = table_test_6.loc[0.0,1.0]/table_test_6.loc['All','All']
test_tII_errors_6 = table_test_6.loc[1.0,0.0]/table_test_6.loc['All','All']

print((
    'Training set accuracy with tuned parameters:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}\n\n'
    'Test set accuracy with tuned parameters:\n'
    'Percent Type I errors: {}\n'
    'Percent Type II errors: {}'
).format(train_tI_errors_6, train_tII_errors_6, test_tI_errors_6, test_tII_errors_6))

Training set accuracy with tuned parameters:
Percent Type I errors: 0.05360065466448445
Percent Type II errors: 0.19148936170212766

Test set accuracy with tuned parameters:
Percent Type I errors: 0.05889570552147239
Percent Type II errors: 0.20122699386503068


### After fine tunning our hyper-parameters, we see an increase of performace on our training data set, as well as increase of performance on our testing data set, because the Type I error on the testing data set actually decreased. However, the Type II error on our testing data set increased. 

In [10]:
clf_para = {'n_estimators':np.arange(100,1000,100),
        'max_depth':np.arange(1,10,1),
        'subsample':np.arange(0.1,1,0.1)}

clf_grid_search = GridSearchCV(clf, clf_para, cv=5, scoring="balanced_accuracy", verbose=10)
clf_grid_search.fit(X_train, y_train)

print(('Best parameter is: {}').format(clf_grid_search.best_params_))

Fitting 5 folds for each of 729 candidates, totalling 3645 fits
[CV] max_depth=1, n_estimators=100, subsample=0.1 ....................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  max_depth=1, n_estimators=100, subsample=0.1, score=0.6989786708003846, total=   0.2s
[CV] max_depth=1, n_estimators=100, subsample=0.1 ....................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s


[CV]  max_depth=1, n_estimators=100, subsample=0.1, score=0.6621087848046623, total=   0.2s
[CV] max_depth=1, n_estimators=100, subsample=0.1 ....................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.4s remaining:    0.0s


[CV]  max_depth=1, n_estimators=100, subsample=0.1, score=0.710861801699272, total=   0.2s
[CV] max_depth=1, n_estimators=100, subsample=0.1 ....................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.7s remaining:    0.0s


[CV]  max_depth=1, n_estimators=100, subsample=0.1, score=0.6854684568768232, total=   0.2s
[CV] max_depth=1, n_estimators=100, subsample=0.1 ....................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.9s remaining:    0.0s


[CV]  max_depth=1, n_estimators=100, subsample=0.1, score=0.698052203619304, total=   0.2s
[CV] max_depth=1, n_estimators=100, subsample=0.2 ....................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.1s remaining:    0.0s


[CV]  max_depth=1, n_estimators=100, subsample=0.2, score=0.7138963561799737, total=   0.2s
[CV] max_depth=1, n_estimators=100, subsample=0.2 ....................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:    1.4s remaining:    0.0s


[CV]  max_depth=1, n_estimators=100, subsample=0.2, score=0.6408296214901007, total=   0.2s
[CV] max_depth=1, n_estimators=100, subsample=0.2 ....................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:    1.6s remaining:    0.0s


[CV]  max_depth=1, n_estimators=100, subsample=0.2, score=0.7003787061201263, total=   0.2s
[CV] max_depth=1, n_estimators=100, subsample=0.2 ....................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:    1.9s remaining:    0.0s


[CV]  max_depth=1, n_estimators=100, subsample=0.2, score=0.6794986887725536, total=   0.2s
[CV] max_depth=1, n_estimators=100, subsample=0.2 ....................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:    2.1s remaining:    0.0s


[CV]  max_depth=1, n_estimators=100, subsample=0.2, score=0.6998459158854229, total=   0.2s
[CV] max_depth=1, n_estimators=100, subsample=0.30000000000000004 ....
[CV]  max_depth=1, n_estimators=100, subsample=0.30000000000000004, score=0.7136726644820752, total=   0.3s
[CV] max_depth=1, n_estimators=100, subsample=0.30000000000000004 ....
[CV]  max_depth=1, n_estimators=100, subsample=0.30000000000000004, score=0.6068206345780272, total=   0.3s
[CV] max_depth=1, n_estimators=100, subsample=0.30000000000000004 ....
[CV]  max_depth=1, n_estimators=100, subsample=0.30000000000000004, score=0.7101132193944627, total=   0.2s
[CV] max_depth=1, n_estimators=100, subsample=0.30000000000000004 ....
[CV]  max_depth=1, n_estimators=100, subsample=0.30000000000000004, score=0.6750886429041478, total=   0.3s
[CV] max_depth=1, n_estimators=100, subsample=0.30000000000000004 ....
[CV]  max_depth=1, n_estimators=100, subsample=0.30000000000000004, score=0.6934100014955802, total=   0.3s
[CV] max_dept

[Parallel(n_jobs=1)]: Done 3645 out of 3645 | elapsed: 323.9min finished


Best parameter is: {'max_depth': 2, 'n_estimators': 100, 'subsample': 0.1}
