# About this notebook 

#### Feature: Description

This notebook employs Word2Vec with CountVectorizer to create a training and test dataset of the pet description field.  Then the datasets are used in the XGBoost algorithm to create an adoption model.

<div class="span5 alert alert-success">
<p> <I> Feature Description: </I> The "Description" data is a profile write-up for each pet.
     <br>
    <I> Source: </I> https://www.kaggle.com/c/petfinder-adoption-prediction/data  </p>
</div>

<div class="span5 alert alert-success">
<p> <I> Predictor (Adoption Speed) Description: </I> 

Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted.   
<br> 
The values are determined in the following way:   
0 - Pet was adopted on the same day as it was listed.    
1 - Pet was adopted between 1 and 7 days (1st week) after being listed.    
2 - Pet was adopted between 8 and 30 days (1st month) after being listed.    
3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.    
4 - No adoption after 100 days of being listed.    

</p>
</div>

<div class="span5 alert alert-info">
<p> <B>  Imports and Data Loading: </B>  </p>
</div>

In [99]:
#Imports
import pandas as pd
import numpy as np
import nltk

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

from xgboost import XGBClassifier

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

from sklearn.ensemble import VotingClassifier

In [100]:
import warnings
warnings.filterwarnings('ignore')

%cd C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData

C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData


In [101]:
#Import the csv file

dfi = pd.read_csv('milestonereport2_petdata_withnooutliers.csv',index_col=0)

#remove the rows with no description 
dfi = dfi[pd.notnull(dfi.Description)]

<div class="span5 alert alert-info">
<p> <B>  NLP using Word2Vec and XGBoost: </B>  </p>
</div>

In [102]:
#Create dataframe of pet description feature
dfd = dfi[['Description','AdoptionSpeed']]
dfd.columns = ['description', 'adoptionspeed']

In [103]:
#Tokenize and lemmatize the description data
mylist = []
for index, row in dfd.iterrows():
    
    #mylist = row[0]
 
    #split sentence into words
    tokens = nltk.word_tokenize(str(row[0]))
    
    #remove all tokens that are not alphabetic
    words = [word for word in tokens if word.isalpha()]
    
    #convert the tokens to lowercase
    wordslc = [word.lower() for word in words]
    
    mylist.append(wordslc)


#print(wordslc)
dfd['tokenized_desc'] = mylist
dfd.head(1)

Unnamed: 0,description,adoptionspeed,tokenized_desc
0,Milo went missing after a week with her new ad...,3,"[milo, went, missing, after, a, week, with, he..."


In [104]:
#Convert tokenized desc from list to string
dfd['tokenized_desc_string'] = str(mylist).strip('[]')
dfd.head(1)

Unnamed: 0,description,adoptionspeed,tokenized_desc,tokenized_desc_string
0,Milo went missing after a week with her new ad...,3,"[milo, went, missing, after, a, week, with, he...","'milo', 'went', 'missing', 'after', 'a', 'week..."


In [105]:
count_vect = CountVectorizer()
count_vect.fit(dfd['description'])

train_description = count_vect.transform(dfd['description'].values)

X = train_description
Y = dfd['adoptionspeed'].values

In [106]:
#Create a training and test data set
test_size = 0.33
seed = 7
X_train_nlp, X_test_nlp, Y_train_nlp, Y_test_nlp = train_test_split(X, Y, test_size=test_size,
random_state=seed)

In [107]:
#XGBoost: Fit the model using default values
model = XGBClassifier()

model.fit(X_train_nlp,Y_train_nlp)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [108]:
#XGBoost: Predict the labels of the test set based on default values
Y_pred_nlp = model.predict(X_test_nlp)

Y_pred_proba_nlp = model.predict_proba(X_test_nlp)


In [109]:
Y_pred_proba_nlp

array([[0.02026377, 0.1886045 , 0.24847971, 0.20768732, 0.3349647 ],
       [0.01931206, 0.15161437, 0.24960448, 0.19871081, 0.38075832],
       [0.01838196, 0.17959955, 0.2514307 , 0.21956421, 0.33102354],
       ...,
       [0.03260574, 0.17825058, 0.37323868, 0.28345072, 0.13245428],
       [0.02515974, 0.20423631, 0.24104653, 0.18865317, 0.3409042 ],
       [0.02815031, 0.19837828, 0.2570151 , 0.18094812, 0.33550814]],
      dtype=float32)

In [110]:
print('CLASSIFICATION REPORT FOR DEFAULT VALUES')
print(classification_report(Y_test_nlp, Y_pred_nlp))

print()

print('ACCURACY SCORE FOR DEFAULT VALUES')
print(accuracy_score(Y_test_nlp,Y_pred_nlp))

CLASSIFICATION REPORT FOR DEFAULT VALUES
             precision    recall  f1-score   support

          0       0.67      0.01      0.03       149
          1       0.38      0.08      0.14      1022
          2       0.32      0.38      0.35      1291
          3       0.40      0.11      0.17      1092
          4       0.35      0.72      0.47      1385

avg / total       0.37      0.35      0.29      4939


ACCURACY SCORE FOR DEFAULT VALUES
0.34561652156306943


<div class="span5 alert alert-info">
<p> <B>  Run the 20 Features (non-NLP) using XGBoost with best parameters identified in Milestone Report Part 2b: </B>  </p>
</div>

In [111]:
#Drop the columns that are not needed for machine learning
dfm = dfi.drop(['Name','RescuerID','Description','PetID'],axis=1)
dfm.head(1)

Unnamed: 0,Type,Age,Breed1,Breed2,Gender,Color1,Color2,Color3,MaturitySize,FurLength,Vaccinated,Dewormed,Sterilized,Health,Quantity,Fee,State,VideoAmt,PhotoAmt,AdoptionSpeed
0,1,2,0,26,2,2,0,0,2,1,1,1,2,1,1,0,41326,0,3,3


In [112]:
#Create the array
array = dfm.values
X = array[:,0:19]
Y = array[:,19]

In [113]:
#Create a training and test data set
test_size = 0.33
seed = 7
X_train_nonnlp, X_test_nonnlp, Y_train_nonnlp, Y_test_nonnlp = train_test_split(X, Y, test_size=test_size,
random_state=seed)

In [114]:
#XGBoost: Fit the model using best hyper parameter tuning results from Part 2b
model = XGBClassifier(colsample_bylevel = 0.5, colsample_bytree = 0.5, learning_rate = 0.1, max_depth = 6,
                      min_child_rate = 0, subsample = 1)
                      
model.fit(X_train_nonnlp,Y_train_nonnlp)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=0.5,
       colsample_bytree=0.5, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=6, min_child_rate=0, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='multi:softprob', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=1)

In [115]:
#XGBoost: Predict the labels of the test set
Y_pred_nonnlp = model.predict(X_test_nonnlp)

Y_pred_proba_nonnlp = model.predict_proba(X_test_nonnlp)

In [116]:
Y_pred_proba_nonnlp

array([[0.01284225, 0.08572476, 0.11942077, 0.19549891, 0.58651334],
       [0.01596327, 0.15398237, 0.14249599, 0.17416093, 0.51339746],
       [0.0130569 , 0.11326262, 0.23594636, 0.26745942, 0.37027466],
       ...,
       [0.04327947, 0.26435426, 0.2020391 , 0.43081114, 0.05951603],
       [0.01125755, 0.09420679, 0.08018898, 0.22931793, 0.5850287 ],
       [0.01940725, 0.3715242 , 0.3618822 , 0.14317116, 0.10401516]],
      dtype=float32)

In [117]:
print('CLASSIFICATION REPORT')
print(classification_report(Y_test_nonnlp, Y_pred_nonnlp))

print()

print('ACCURACY SCORE')
print(accuracy_score(Y_test_nonnlp,Y_pred_nonnlp))

CLASSIFICATION REPORT
             precision    recall  f1-score   support

          0       0.50      0.01      0.03       149
          1       0.38      0.33      0.35      1022
          2       0.34      0.44      0.38      1291
          3       0.39      0.18      0.24      1092
          4       0.48      0.65      0.55      1385

avg / total       0.40      0.40      0.38      4939


ACCURACY SCORE
0.4041303907673618


<div class="span5 alert alert-info">
<p> <B>  Stacking Model: Create an ensemble model that combines the NLP model with the non-NLP (20 Features) model </B>  </p>
</div>

<div class="span5 alert alert-success">
<p> <I> Run the stacked model using the Y_predict values as input  </p>
</div>

In [118]:
#Create a data frame of the first models nlp and nonnlp predicted results.  Then transpose it.
df_smf = pd.DataFrame([Y_pred_nlp, Y_pred_nonnlp, Y_test_nlp]).transpose()

df_smf.columns = ['firstmodel_nlp_predicted','firstmodel_nonnlp_predicted', 'firstmodel_Y_values']

In [119]:
#Create the array for the stacked model using the predicted values
array = df_smf.values
X = array[:,0:2]
Y = array[:,2]

In [120]:
#Create a train and test dataset using the predicted values to use on the stacked model
test_size = 0.33
seed = 7
X_train_stacked, X_test_stacked, Y_train_stacked, Y_test_stacked = train_test_split(X, Y, test_size=test_size,
random_state=seed)

In [121]:
#XGBoost: Fit the model that uses the predicted values
model = XGBClassifier()
                      
model.fit(X_train_stacked,Y_train_stacked)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [122]:
#XGBoost: Predict the labels of the test set for the predicted values
Y_pred_stacked = model.predict(X_test_stacked)

In [123]:
print('CLASSIFICATION REPORT')
print(classification_report(Y_test_stacked, Y_pred_stacked))

print()

print('ACCURACY SCORE')
print(accuracy_score(Y_test_stacked,Y_pred_stacked))

CLASSIFICATION REPORT
             precision    recall  f1-score   support

          0       0.00      0.00      0.00        52
          1       0.33      0.33      0.33       328
          2       0.32      0.47      0.38       425
          3       0.47      0.09      0.16       371
          4       0.47      0.63      0.54       454

avg / total       0.39      0.39      0.35      1630


ACCURACY SCORE
0.3852760736196319


<div class="span5 alert alert-success">
<p> <I> Run the stacked model using the Y_predict_proba values as input  </p>
</div>

In [124]:
#Obtain the probabilities of correct nonnlp predictions from Y_pred_proba_nlp
df_stacked = pd.DataFrame()

mainlist_nonnlp = Y_pred_proba_nonnlp.tolist()

nonnlpnewlist1 = []
nonnlpnewlist2 = []
nonnlpnewlist3 = []
nonnlpnewlist4 = []
nonnlpnewlist5 = []

for i,v in enumerate(mainlist_nonnlp):
    nonnlpnewlist1.append(v[0])
    nonnlpnewlist2.append(v[1])
    nonnlpnewlist3.append(v[2])
    nonnlpnewlist4.append(v[3])
    nonnlpnewlist5.append(v[4])

df_stacked['firstmodel_nonnlp_predicted_proba1'] = nonnlpnewlist1
df_stacked['firstmodel_nonnlp_predicted_proba2'] = nonnlpnewlist2
df_stacked['firstmodel_nonnlp_predicted_proba3'] = nonnlpnewlist3
df_stacked['firstmodel_nonnlp_predicted_proba4'] = nonnlpnewlist4
df_stacked['firstmodel_nonnlp_predicted_proba5'] = nonnlpnewlist5

df_stacked.head(2)

Unnamed: 0,firstmodel_nonnlp_predicted_proba1,firstmodel_nonnlp_predicted_proba2,firstmodel_nonnlp_predicted_proba3,firstmodel_nonnlp_predicted_proba4,firstmodel_nonnlp_predicted_proba5
0,0.012842,0.085725,0.119421,0.195499,0.586513
1,0.015963,0.153982,0.142496,0.174161,0.513397


In [125]:
#Obtain the probability of a correct nlp prediction from Y_pred_proba_nlp
mainlist_nlp = Y_pred_proba_nlp.tolist()

nlpnewlist1 = []
nlpnewlist2 = []
nlpnewlist3 = []
nlpnewlist4 = []
nlpnewlist5 = []

for i,v in enumerate(mainlist_nlp):
    nlpnewlist1.append(v[0])
    nlpnewlist2.append(v[1])
    nlpnewlist3.append(v[2])
    nlpnewlist4.append(v[3])
    nlpnewlist5.append(v[4])

df_stacked['firstmodel_nlp_predicted_proba1'] = nlpnewlist1
df_stacked['firstmodel_nlp_predicted_proba2'] = nlpnewlist2
df_stacked['firstmodel_nlp_predicted_proba3'] = nlpnewlist3
df_stacked['firstmodel_nlp_predicted_proba4'] = nlpnewlist4
df_stacked['firstmodel_nlp_predicted_proba5'] = nlpnewlist5

df_stacked.head(2)

Unnamed: 0,firstmodel_nonnlp_predicted_proba1,firstmodel_nonnlp_predicted_proba2,firstmodel_nonnlp_predicted_proba3,firstmodel_nonnlp_predicted_proba4,firstmodel_nonnlp_predicted_proba5,firstmodel_nlp_predicted_proba1,firstmodel_nlp_predicted_proba2,firstmodel_nlp_predicted_proba3,firstmodel_nlp_predicted_proba4,firstmodel_nlp_predicted_proba5
0,0.012842,0.085725,0.119421,0.195499,0.586513,0.020264,0.188605,0.24848,0.207687,0.334965
1,0.015963,0.153982,0.142496,0.174161,0.513397,0.019312,0.151614,0.249604,0.198711,0.380758


In [126]:
#drop the Y_predict columns in stacked model dataframe and reorder the dataframe
df_stacked['firstmodel_Y_values'] = df_smf['firstmodel_Y_values']

df_stacked.head(2)

Unnamed: 0,firstmodel_nonnlp_predicted_proba1,firstmodel_nonnlp_predicted_proba2,firstmodel_nonnlp_predicted_proba3,firstmodel_nonnlp_predicted_proba4,firstmodel_nonnlp_predicted_proba5,firstmodel_nlp_predicted_proba1,firstmodel_nlp_predicted_proba2,firstmodel_nlp_predicted_proba3,firstmodel_nlp_predicted_proba4,firstmodel_nlp_predicted_proba5,firstmodel_Y_values
0,0.012842,0.085725,0.119421,0.195499,0.586513,0.020264,0.188605,0.24848,0.207687,0.334965,4
1,0.015963,0.153982,0.142496,0.174161,0.513397,0.019312,0.151614,0.249604,0.198711,0.380758,2


In [127]:
#Create the array for the stacked model using the predict_proba values
array = df_stacked.values
X = array[:,0:10]
Y = array[:,10]

In [128]:
#Create a train and test dataset using the predict_proba values to use on the stacked model
test_size = 0.33
seed = 7
X_train_stacked, X_test_stacked, Y_train_stacked, Y_test_stacked = train_test_split(X, Y, test_size=test_size,
random_state=seed)

In [129]:
#XGBoost: Fit the model that uses the predict_proba values
model = XGBClassifier()
                      
model.fit(X_train_stacked,Y_train_stacked)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [130]:
#XGBoost: Predict the labels of the test set for the predicted values
Y_pred_proba_stacked = model.predict(X_test_stacked)

In [131]:
print('CLASSIFICATION REPORT')
print(classification_report(Y_test_stacked, Y_pred_proba_stacked))

print()

print('ACCURACY SCORE')
print(accuracy_score(Y_test_stacked,Y_pred_proba_stacked))

CLASSIFICATION REPORT
             precision    recall  f1-score   support

        0.0       0.25      0.02      0.04        52
        1.0       0.35      0.42      0.38       328
        2.0       0.37      0.39      0.38       425
        3.0       0.49      0.17      0.25       371
        4.0       0.48      0.69      0.56       454

avg / total       0.42      0.42      0.39      1630


ACCURACY SCORE
0.4171779141104294
