<h1> Making an predictive model using Decision Tree</h1>
In deze jupyter notebook file, worden de de decision tree gemaakt. 
Dit wordt gedaan om de onderzoeksvraag van mijn thesis te kunnen beantwoorden:
To what extent can support vector machine, randomforest tree, or Gradient Boosting Machine contributeto predicting the demand for the specialist youth caresegments in Amsterdam?
Ook is dit nodig voor het beantwoorden van mijn sub vraen:
•Are there neighborhood socio-demographic characteristics which are predictive of the use of youth caresegments?
•Which of the tested models has the highest f1 score in predicting the youth care segment use?


<h3>Hier onder worden eerst de benodigde librabry geimporteerd</h3>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy import mean
from sklearn import decomposition, datasets
from sklearn.tree import DecisionTreeClassifier

from collections import Counter

from sklearn.metrics import plot_confusion_matrix, classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

from imblearn.pipeline import Pipeline
from imblearn.under_sampling import RandomUnderSampler


Om de sub onderzoeks vraag: Which of the tested models has the highest f1-score in predicting the youth care segment use? Waarom we voor deze score hebben gekozen, kan gelezen worden onder het kopje "model eveluation".

Ook maken we een aantal variabele hier aan om de code zo gestructuurd mogelijk te houden. Waarom deze nodig zijn, valt te lezen in het kopje "model making'

In [2]:
# Defining Def inorder to calcualte some metrics
def calculateMetrics(model):
        y_predicted = model.predict(X_test)
        print(model)
        print ("precision_score")
        print(precision_score(y_test, y_predicted, average='micro'))
        print ("recall score")
        print(recall_score(y_test, y_predicted, average='micro'))
        print ("F1 score")
        print(f1_score(y_test, y_predicted, average='micro'))
        
# Some variables
param_dict = { 
    'criterion': ['gini', 'entropy'],
    'max_depth': range(4,26,4),
    'min_samples_split': range(1,10,2),
    'min_samples_leaf': range(1,5)
}

cv_method = RepeatedStratifiedKFold(n_splits=5, 
                                    n_repeats=3, 
                                    random_state=42)



<h3> Laad de data in, die gemaakt is uit de andere jupyter notebook file</h3

In [3]:
## load data
df = pd.read_pickle("C:\\VERTROUWELIJK\\final_dataSet.pkl")

Split the data in indepentend variable an dependent variable. Also get dummies from the binary values in the data set. 

In [4]:
X = df.drop(['Voorziening'], axis=1)
X_encoded = pd.get_dummies(X, columns=['Geslacht'])
y = df['Voorziening'].copy()

Make test and train set. Waarom dit nodig is, zie "making model"

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X_encoded,y,random_state=42)

 <h4>Make the first DecisionTree</h4>
 and fit this to get the scores. Dit is nodig om alle onderzoeksvragen mee te bentwoorden

In [6]:
# Make the first DecisionTree, and fit this to get the scores.
clf_dt = DecisionTreeClassifier(random_state = 42)
print(clf_dt)
scores = cross_val_score(clf_dt, X_train, y_train, cv=5, scoring='f1_micro')
score = mean(scores)
print("f1_score: %.2f%%" % (score * 100.0))

DecisionTreeClassifier(random_state=42)
f1_score: 66.42%


Zoals gezegd, de data is erg imbalanced. Daarom maken we een decision tree with random undersampling

In [None]:
steps = [('under', RandomUnderSampler()), ('model', DecisionTreeClassifier(random_state=42))]
pipeline = Pipeline(steps=steps)
# evaluate pipeline
cv = RepeatedStratifiedKFold(n_splits=100, n_repeats=5, random_state=42)
scores = cross_val_score(pipeline, X_encoded, y, scoring='f1_micro', cv=cv, n_jobs=-1)
score = mean(scores)
# calculate the mean of all these models. 
print('F1 Score: %.3f' % score)

<h4>Prune the model</h4>
Na het vergelijken van de twee modellen, heeft het model met de normale data set de hoogste F1 score, hier na gaan we het model nog prunen. Zie Making Model in de thesis 

In [None]:
grid = GridSearchCV(clf_dt,
                   param_grid=param_dict,
                   cv=cv_method,
                    scoring='f1_micro',
                   verbose=1,
                   n_jobs=-1)
grid.fit(X_train, y_train)

In [None]:
# Print the best parameters after doing the gridsearch
grid.best_params_

In [None]:
# Making the last DecisionTree with best parameter coming from the GridSearch. 
clf_dt = DecisionTreeClassifier(max_depth=16,criterion=
                            'entropy',min_samples_leaf=1, min_samples_split=3)
clf_dt = clf_dt.fit(X_train, y_train)
calculateMetrics(clf_dt)

In [None]:
# Create an Confusion Matrix to get an insight of how well the model is performing. 
fig, ax = plt.subplots(figsize=(20,20))
test = plot_confusion_matrix(clf_dt, X_test, y_test, ax=ax)

In order to visulation the gridsearch, we needed to make an figure. This can be seen in Figure X in the thesis. 

In [None]:
results_DT = pd.DataFrame(grid.cv_results_['params'])
results_DT['test_score'] = grid.cv_results_['mean_test_score']
for i in ['gini', 'entropy']:
    temp = results_DT[results_DT['criterion'] == i]
    temp_average = temp.groupby('max_depth').agg({'test_score': 'mean'})
    plt.plot(temp_average, marker = '.', label = i)
    
    
plt.legend()
plt.xlabel('Max Depth')
plt.ylabel("Mean CV Score")
plt.title("DT Performance Comparison")
plt.show()

Om de onderzoeks vraag: Are there neighborhood socio-demographic characteristics which are predictive of the use of youth caresegments? is onderstaande code nodig

In [None]:
# Get numerical feature importances
importances = list(clf_dt.feature_importances_)
feature_list = list(X.columns)
# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)
# Print out the feature and importances 
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];