### Decision Tree Prototype
This file uses the insights gathered in the capstone_ml_prototype pipeline to fit a hyperparameter-tuned decision tree to the test set to compare accuracy scores across the training and test sets (we use the validation set as the test set, as the test set is not labelled).

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
#from catboost import CatBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

In [2]:
#Load and split the data
training_data = pd.read_csv("train.csv")
testing_data = pd.read_csv("test.csv")
costa_rica_data = training_data.drop(['Target'], axis=1)
costa_rica_target = training_data['Target']

#Clean the data to either replace or remove string columns
costa_rica_data.select_dtypes(exclude=[np.number]).head()
costa_rica_data = costa_rica_data.select_dtypes(include=[np.number], exclude=[np.object]).fillna(0)

#Split data into 80% train, 20% validation split
X_train, X_test, y_train, y_test = train_test_split(costa_rica_data.values, costa_rica_target.values, test_size= 0.2, random_state=42)

In [3]:
#Fit a Decision Tree with hyperparameters to get a baseline idea of performance
clf = DecisionTreeClassifier(criterion='gini', max_depth=75, random_state=42)
model = clf.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print('Decision Tree Train Accuracy: '+str(round(train_score*100,2))+'%')
print('Decision Tree Test Accuracy: '+str(round(test_score*100,2))+'%')

Decision Tree Train Accuracy: 100.0%
Decision Tree Test Accuracy: 90.74%


In [4]:
#Run Feature Importance to extract relevant features
features = model.feature_importances_
features_dict = dict(zip(costa_rica_data.columns.values, features))
ranked_features_df = pd.DataFrame(features_dict, 
                                  index=['Feature Importance']).T.sort_values('Feature Importance', ascending=False)[:10]
ranked_features_df

Unnamed: 0,Feature Importance
meaneduc,0.095388
SQBdependency,0.049903
SQBedjefe,0.048171
SQBmeaned,0.039866
SQBhogar_nin,0.033865
qmobilephone,0.030314
r4m2,0.028967
r4h3,0.028142
overcrowding,0.025013
rooms,0.023156


In [5]:
#Fit Decision Tree with these most important features
X_train, X_test, y_train, y_test = train_test_split(costa_rica_data[ranked_features_df.index].values, costa_rica_target.values, test_size= 0.2, random_state=42)
clf = DecisionTreeClassifier(criterion='gini', max_depth=60)
model = clf.fit(X_train, y_train)
y_test_pred = model.predict(X_test)
train_score = model.score(X_train, y_train)
val_score = model.score(X_test, y_test)
print('Decision Tree Train Accuracy: '+str(round(train_score*100,2))+'%')
print('Decision Tree Validation Accuracy: '+str(round(val_score*100,2))+'%')

Decision Tree Train Accuracy: 97.7%
Decision Tree Validation Accuracy: 93.51%


### Model Results
We can see that our model's train and validation accuracy are close together. Since our test set does not have labels, we cannot use that to further corroborate our model.