# About this notebook 

#### Objective: Petfinder Machine Learning - Part 2b


<div class="span5 alert alert-success">
<p> <I> Petfinder Machine Learning: </I> The objective for this report is to share a second round of machine learning results to build a model to predict the speed at which a pet is adopted.  In the second round the following changes to try and improve the model are applied...  
<br>
1.	Outliers (pets older than 12 years old are removed from the data)   
2.	Scalarization is applied to features where this might help the algorithms produce a better model  
3.	Hyper-parameter tuning is applied to the algorithms  
4.	Word-2-Vec is applied to the Pet Description for use in NLP  
5.	An ensemble combines the best performing non-NLP with the best performing NLP algorithm  
    <br>
The classification areas are...   
    <br>
0 - Pet was adopted on the same day it was listed  
1 - Pet was adopted between 1 and 7 days (1st week) after being listed  
2 - Pet was adopted between 8 and 30 days (1st month) after being listed  
3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed  
4 – No adoption after 100 days of being listed. (There are no pets in this dataset that waited between 90 and 100 days   
 </p>
</div>

<div class="span5 alert alert-success">
<p> <I> Data fields: </I> For a list of the features available to predict the adoption rate visit the source data at: <br>   https://www.kaggle.com/c/petfinder-adoption-prediction/data 
 </p>
</div>

<div class="span5 alert alert-info">
<p> <B>  Imports and Data Loading: </B>  </p>
</div>

In [1]:
import warnings
warnings.filterwarnings('ignore')

%cd C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData

C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData


In [2]:
#Imports
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import scale

from sklearn.neighbors import KNeighborsClassifier 

from sklearn.naive_bayes import GaussianNB

from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier
from xgboost import plot_importance
from matplotlib import pyplot

from sklearn import svm

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import validation_curve

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
#Import the csv file
dfi = pd.read_csv('train.csv')

In [5]:
#Remove outliers
dfi2 = dfi[dfi.Age < (12 * 12)]

print('Total pets with outliers included:' + str(dfi.Age.count()))
print()
print('Total pets with outliers deleted:' + str(dfi2.Age.count()))
print()
print('Total pets removed: ' + str(dfi.Age.count() - dfi2.Age.count()))

Total pets with outliers included:14993

Total pets with outliers deleted:14978

Total pets removed: 15


In [6]:
#Write the dataframe with outliers removed to a csv file for future use
out_csv = 'milestonereport2_petdata_withnooutliers.csv'
dfi2.to_csv(out_csv)

In [32]:
#Drop the columns that are not needed for machine learning
dfm = dfi2.drop(['Name','RescuerID','Description','PetID'],axis=1)
dfm.head(1)

Unnamed: 0,Type,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,FurLength,Vaccinated,Dewormed,Sterilized,Health,Quantity,Fee,State,VideoAmt,PhotoAmt,AdoptionSpeed
0,1,2,0,26,2,2,0,0,2,1,1,1,2,1,1,0,41326,0,3,3


<div class="span5 alert alert-info">
<p> <B>  Prepare to run the algorithms </B> 

</p>
</div>

In [33]:
#Create the array
array = dfm.values
X = array[:,0:19]
Y = array[:,19]

In [34]:
#Apply scaling
X_scaled = scale(X)

In [35]:
#Create a training and test data set
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, test_size=test_size,
random_state=seed)

<div class="span5 alert alert-info">
<p> <B>  Hyper-Parameter Tuning for XGBoost </B>  <br> <br>

<I>Approach: </I>  
1. Use Validation Curves to identify the best value for several parameters<br>  
2. Use GridSearchCV to combine the best parameters and identify the parameters for XGBoost that produce the best model

</p>
</div>

<div class="span5 alert alert-success">
<p> XGBoost: The parameters below were selected as the best ones to tune based on a talk by Owen Zhang at ODSC Boston 2015 titled Open Source Tools and Data Science Competitions; he summarized common parameters he uses as...  
    <br>
 Column Sampling (colsample bytree and maybe colsample bylevel) grid searched values in the range [0.3, 0.5,1].<br>  

 Tree Size (max depth) grid searched values in the range [3,6, 8, 10].<br>  

 Learning Rate (learning rate) simplified to the ratio: [2 to 10] trees , depending on the number of trees.<br>  

 Min LeafWeight(min child weight) simplifed to the ratio of rare events; rare events is the percentage of rare event observations in the dataset. Try [0 or 1]<br>  

 Row Sampling (subsample) grid searched values in the range [0.5, 0.75, 1.0].

 </p>
</div>

<div class="span5 alert alert-success">
<p> Run XGBoost using the default values 
 </p>
</div>

In [51]:
#XGBoost: Fit the model using default values
model = XGBClassifier()

model.fit(X_train,Y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [52]:
#XGBoost: Predict the labels of the test set based on default values
Y_pred = model.predict(X_test)

In [53]:
print('CLASSIFICATION REPORT FOR DEFAULT VALUES')
print(classification_report(Y_test, Y_pred))

print()

print('ACCURACY SCORE FOR DEFAULT VALUES')
print(accuracy_score(Y_test,Y_pred))

CLASSIFICATION REPORT FOR DEFAULT VALUES
             precision    recall  f1-score   support

          0       1.00      0.01      0.02       130
          1       0.35      0.31      0.33      1001
          2       0.34      0.43      0.38      1339
          3       0.41      0.16      0.23      1096
          4       0.47      0.66      0.55      1377

avg / total       0.41      0.40      0.37      4943


ACCURACY SCORE FOR DEFAULT VALUES
0.39995953874165485


<div class="span5 alert alert-success">
<p> <I> Validation Curves: </I> Use validation curves to identify the best value for some of the parameters 
 </p>
</div>

In [36]:
#Column Sampling - bytree
param_range = [0.3,0.5,1]
train_scores, test_scores = validation_curve(
    XGBClassifier(), X_train, Y_train, param_name="colsample_bytree", param_range=param_range,
    cv=5, scoring="accuracy", n_jobs=1)

print('TRAIN SCORES')
print(train_scores)

TRAIN SCORES
[[0.43882382 0.44101159 0.43739878 0.44047323 0.43823163]
 [0.44580115 0.4421328  0.44736514 0.4486924  0.44171856]
 [0.44804386 0.44599477 0.44611935 0.45280199 0.44856787]]


In [37]:
#Column Sampling - bylevel
param_range = [0.3, 0.5,1]
train_scores, test_scores = validation_curve(
    XGBClassifier(), X_train, Y_train, param_name="colsample_bylevel", param_range=param_range,
    cv=5, scoring="accuracy", n_jobs=1)

print('TRAIN SCORES')
print(train_scores)

TRAIN SCORES
[[0.43982058 0.43876915 0.44462439 0.44171856 0.44259029]
 [0.44405682 0.44175906 0.44611935 0.4486924  0.44595268]
 [0.44804386 0.44599477 0.44611935 0.45280199 0.44856787]]


In [38]:
#Max_depth
param_range = [3,6,8,10]
train_scores, test_scores = validation_curve(
    XGBClassifier(), X_train, Y_train, param_name="max_depth", param_range=param_range,
    cv=5, scoring="accuracy", n_jobs=1)

print('TRAIN SCORES')
print(train_scores)

TRAIN SCORES
[[0.44804386 0.44599477 0.44611935 0.45280199 0.44856787]
 [0.59506604 0.58789087 0.60059798 0.61594022 0.60323786]
 [0.73361575 0.7277937  0.7436153  0.74122042 0.7237858 ]
 [0.86294543 0.8458951  0.86159213 0.84533001 0.84657534]]


In [39]:
#learning Rate
param_range = [0.1,0.5,1,3,5,10]
train_scores, test_scores = validation_curve(
    XGBClassifier(), X_train, Y_train, param_name="learning_rate", param_range=param_range,
    cv=5, scoring="accuracy", n_jobs=1)

print('TRAIN SCORES')
print(train_scores)

TRAIN SCORES
[[0.44804386 0.44599477 0.44611935 0.45280199 0.44856787]
 [0.52417144 0.5292139  0.5292139  0.53088418 0.53138232]
 [0.59182656 0.58577302 0.58751713 0.59825654 0.59115816]
 [0.30201844 0.27258004 0.25077862 0.28032379 0.28381071]
 [0.26339397 0.28030397 0.21502429 0.26861768 0.23138232]
 [0.21505108 0.28030397 0.20804784 0.20809465 0.22216687]]


In [40]:
#Min Leaf Weight (min_child_weight)
param_range = [0,1]
train_scores, test_scores = validation_curve(
    XGBClassifier(), X_train, Y_train, param_name="min_child_weight", param_range=param_range,
    cv=5, scoring="accuracy", n_jobs=1)

print('TRAIN SCORES')
print(train_scores)

TRAIN SCORES
[[0.44754548 0.44512271 0.44512271 0.45367372 0.44819427]
 [0.44804386 0.44599477 0.44611935 0.45280199 0.44856787]]


In [41]:
#Row Sampling (subsample)
param_range = [0.5,0.75,1]
train_scores, test_scores = validation_curve(
    XGBClassifier(), X_train, Y_train, param_name="subsample", param_range=param_range,
    cv=5, scoring="accuracy", n_jobs=1)

print('TRAIN SCORES')
print(train_scores)

TRAIN SCORES
[[0.45028657 0.45596113 0.45222374 0.45815691 0.45180573]
 [0.45041116 0.45621029 0.45010589 0.45703611 0.45603985]
 [0.44804386 0.44599477 0.44611935 0.45280199 0.44856787]]


<div class="span5 alert alert-success">
<p> Run XGBoost using the best value from the validation curve results 
 </p>
</div>

In [46]:
#XGBoost: Fit the model by setting the best performing parameter from the above valiation curve results.
model = XGBClassifier(max_depth=10,learning_rate=1,subsample=0.5)

model.fit(X_train,Y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=1, max_delta_step=0,
       max_depth=10, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=0.5)

In [47]:
#XGBoost: Predict the labels of the test set
Y_pred = model.predict(X_test)

In [48]:
print('CLASSIFICATION REPORT')
print(classification_report(Y_test, Y_pred))

print()

print('ACCURACY SCORE')
print(accuracy_score(Y_test,Y_pred))

CLASSIFICATION REPORT
             precision    recall  f1-score   support

          0       0.18      0.09      0.12       130
          1       0.33      0.32      0.33      1001
          2       0.34      0.34      0.34      1339
          3       0.31      0.29      0.30      1096
          4       0.46      0.51      0.48      1377

avg / total       0.36      0.37      0.36      4943


ACCURACY SCORE
0.3671859194820959


<div class="span5 alert alert-success">
<p> Run XGBoost using GridSearchCV with the same values for each parameter used in the validation curves 
 </p>
</div>

In [50]:
#Using GridSearchCV on parameters to tune
model = XGBClassifier()
colsample_bylevel = [0.3,0.5,1]
colsample_bytree = [0.3,0.5,1]
max_depth = [3,6,8,10]
learning_rate = [0.1,0.5,1,3,5,10]
min_child_rate = [0,1]
subsample = [0.5,0.75,1]

param_grid = dict(colsample_bylevel = colsample_bylevel, colsample_bytree=colsample_bytree,
                  max_depth=max_depth, learning_rate=learning_rate, min_child_rate=min_child_rate, subsample=subsample)

kfold = KFold(n_splits=5, shuffle=True, random_state=7)

grid_search = GridSearchCV(model, param_grid, scoring="accuracy", n_jobs=-1, cv=kfold,
verbose=1)

grid_result = grid_search.fit(X_train,Y_train)

print('Grid result best score:')
print(grid_result.best_score_)
print()
print('Grid result best parameters:')
print(grid_result.best_params_)

Fitting 5 folds for each of 1296 candidates, totalling 6480 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  5.5min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 12.7min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 23.6min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed: 40.8min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 58.3min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 74.6min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 95.0min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 124.4min
[Parallel(n_jobs=-1)]: Done 4992 tasks      | elapsed: 152.5min
[Parallel(n_jobs=-1)]: Done 6042 tasks      | elapsed: 201.1min
[Parallel(n_jobs=-1)]: Done 6480 out of 6480 | elapsed: 219.1min finished


Grid result best score:
0.40408570004982564

Grid result best parameters:
{'colsample_bylevel': 0.5, 'colsample_bytree': 0.5, 'learning_rate': 0.1, 'max_depth': 6, 'min_child_rate': 0, 'subsample': 1}


<div class="span5 alert alert-info">
<p> <B>  Hyper-Parameter Tuning for RandomForest </B>  <br> <br>

<I>Approach: </I>  
1. Use GridSearchCV to combine the best parameters and identify the parameters for XGBoost that produce the best model

</p>
</div>

<div class="span5 alert alert-success">
<p> Run Random Forest using the default values
 </p>
</div>

In [54]:
#Random Forest: Fit the model
model = RandomForestClassifier()

model.fit(X_train,Y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [55]:
#Random Forest: Predict the labels of the test set
Y_pred = model.predict(X_test)

In [56]:
print('CLASSIFICATION REPORT')
print(classification_report(Y_test, Y_pred))

print()

print('ACCURACY SCORE')
print(accuracy_score(Y_test,Y_pred))

CLASSIFICATION REPORT
             precision    recall  f1-score   support

          0       0.14      0.05      0.08       130
          1       0.33      0.38      0.35      1001
          2       0.33      0.33      0.33      1339
          3       0.32      0.27      0.29      1096
          4       0.48      0.51      0.50      1377

avg / total       0.36      0.37      0.37      4943


ACCURACY SCORE
0.370220513857981


<div class="span5 alert alert-success">
<p> Run Random Forest using GridSearchCV
 </p>
</div>

In [60]:
#Using GridSearchCV on Random Forest parameters to tune
model = RandomForestClassifier()

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

max_depth = [10,20,30,40,50,60,70,80,90,100,110]

min_samples_split = [2,5,10]

min_samples_leaf = [25,50,100]

param_grid = dict(n_estimators=n_estimators, max_depth = max_depth, min_samples_split=min_samples_split,
                  min_samples_leaf=min_samples_leaf)

kfold = KFold(n_splits=5, shuffle=True, random_state=7)

grid_search = GridSearchCV(model, param_grid, scoring="accuracy", n_jobs=-1, cv=kfold,
verbose=1)

grid_result = grid_search.fit(X_train,Y_train)

print('Grid result best score:')
print(grid_result.best_score_)
print()
print('Grid result best parameters:')
print(grid_result.best_params_)

Fitting 5 folds for each of 990 candidates, totalling 4950 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  9.0min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 19.5min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 35.6min
[Parallel(n_jobs=-1)]: Done 1242 tasks      | elapsed: 55.8min
[Parallel(n_jobs=-1)]: Done 1792 tasks      | elapsed: 80.0min
[Parallel(n_jobs=-1)]: Done 2442 tasks      | elapsed: 109.7min
[Parallel(n_jobs=-1)]: Done 3192 tasks      | elapsed: 142.8min
[Parallel(n_jobs=-1)]: Done 4042 tasks      | elapsed: 180.6min
[Parallel(n_jobs=-1)]: Done 4950 out of 4950 | elapsed: 221.4min finished


Grid result best score:
0.3985052316890882

Grid result best parameters:
{'max_depth': 10, 'min_samples_leaf': 25, 'min_samples_split': 2, 'n_estimators': 1000}
