## Introduction

This exercise is performed to check the performance of the following Classifiers using the K_Fold Cross Validation check.
1. Decision Tree (previously modelled)
2. Support Vector Machine (SVC)
3. Random Forest

We are also going to see whether using a differnt parameters on the Decision Tree e.g. *max_leaf_nodes* and *class_weight* will help improve the model accuracy.

The K_Fold cross validation is a technique to *evaluate model performance*. We will use this technique to decide which type of model should be applied on the **patient_no_show** data to get maximum accuracy.

*Reminder*: We are creating a model to predict **if patients fail to show up to their doctor appointments**.

In [6]:
#importing required libraries

import pandas as pd
import numpy as no

#importing sklearn libraries

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

from sklearn.preprocessing import LabelEncoder

from sklearn import tree  #Decision Tree
from sklearn.svm import SVC  #Support Vector Machine
from sklearn.ensemble import RandomForestClassifier  #Random Forest


In [3]:
#importing the data into a df

df = pd.read_csv('no_show_data_clean.csv')

In [5]:
df.head()

Unnamed: 0,patientid,gender,scheduledday,appointmentday,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,29872500000000.0,F,2016-04-29 18:38:08+00:00,2016-04-29 00:00:00+00:00,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,M,2016-04-29 16:08:27+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,F,2016-04-29 16:19:04+00:00,2016-04-29 00:00:00+00:00,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,F,2016-04-29 17:29:31+00:00,2016-04-29 00:00:00+00:00,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,F,2016-04-29 16:07:23+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,1,1,0,0,0,No


### Cleaning data 

In [7]:
df_clean = df.copy()

In [8]:
#label encoding categorical variables and target variable

le_gender = LabelEncoder()
le_neighbourhood = LabelEncoder()
le_no_show = LabelEncoder()


In [10]:
df_clean['gender_n'] = le_no_show.fit_transform(df_clean['gender'])
df_clean['neighbourhood_n'] = le_no_show.fit_transform(df_clean['neighbourhood'])
df_clean['no_show_n'] = le_no_show.fit_transform(df_clean['no_show'])

In [12]:
df_clean.head()

Unnamed: 0,patientid,gender,scheduledday,appointmentday,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show,gender_n,neighbourhood_n,no_show_n
0,29872500000000.0,F,2016-04-29 18:38:08+00:00,2016-04-29 00:00:00+00:00,62,JARDIM DA PENHA,0,1,0,0,0,0,No,0,39,0
1,558997800000000.0,M,2016-04-29 16:08:27+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,0,0,0,0,0,No,1,39,0
2,4262962000000.0,F,2016-04-29 16:19:04+00:00,2016-04-29 00:00:00+00:00,62,MATA DA PRAIA,0,0,0,0,0,0,No,0,45,0
3,867951200000.0,F,2016-04-29 17:29:31+00:00,2016-04-29 00:00:00+00:00,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No,0,54,0
4,8841186000000.0,F,2016-04-29 16:07:23+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,1,1,0,0,0,No,0,39,0


In [15]:
#dropping the extra columns not required in modelling

df_clean.drop(['patientid', 'gender', 'scheduledday', 'appointmentday', 'neighbourhood', 'no_show'], axis = 1, inplace = True)

In [16]:
df_clean.head()

Unnamed: 0,age,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,gender_n,neighbourhood_n,no_show_n
0,62,0,1,0,0,0,0,0,39,0
1,56,0,0,0,0,0,0,1,39,0
2,62,0,0,0,0,0,0,0,45,0
3,8,0,0,0,0,0,0,0,54,0
4,56,0,1,1,0,0,0,0,39,0


In [21]:
#saving the csv as modelling_data file for future modelling needs

df_clean.to_csv('no_show_data_modelling.csv', index = False)

In [20]:
#calculating the Event Rate i.e. no. of 0s and 1s in the target variable

df_clean.groupby('no_show_n').size()

no_show_n
0    88207
1    22319
dtype: int64

Seeing that the **Event Rate** is **20%** (not 50%), we will have to use *class_weight* parameter in Decision Tree as *balanced*

We are going to repeat the **Decision Tree Classification** Exercise *without* the class_weight parameter to have a benchmark score

In [23]:
#creating x_var and y_var

x_var = df_clean.drop('no_show_n', axis = 1)
y_var = df_clean['no_show_n']

In [24]:
#creating a train_test split of the data

X_train, X_test, y_train, y_test = train_test_split(x_var, y_var, test_size = 0.2, random_state = 123)

#### Decision Tree (without class_weight parameter)

In [22]:
#creating the model class object WITHOUT the class_weight parameter

dt_model = tree.DecisionTreeClassifier()

In [25]:
#training the DT model using the X_train y_train data

dt_model.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [27]:
#calculating the accuracy score

dt_model.score(X_test, y_test)

0.759974667511083

We have a **76%** accuracy score benchmark

The benchmark is for *Decision Tree model WITHOUT class_weight parameter

### Using K-Fold Cross Validation test to evaluate the best Classifier for modelling the Patients No_Show

In [35]:
#creating model class objects for different Classifiers

dt = tree.DecisionTreeClassifier()  #decision tree without class_weight parameter
dt_class_weight = tree.DecisionTreeClassifier(class_weight = 'balanced')  #decision tree with class_weight parameter
svc = SVC()  #support vector classifier
rand_forest = RandomForestClassifier()

In [38]:
#calling the cross_val_score to calculate the scores for every Classifier type
#we are going to take a mean of the 10 fold scores to get an avg accuracy score for every 

dt_score = cross_val_score(dt, x_var, y_var).mean()


In [39]:
dt_class_weight_score = cross_val_score(dt_class_weight, x_var, y_var).mean()


In [41]:
rand_forest_score = cross_val_score(rand_forest, x_var, y_var).mean()

In [57]:
print('Accuracy of Decision Tree without Class_weight is : {:.3f}'.format(dt_score))
print('Accuracy of Decision Tree with Class_weight is : {:.3f}'.format(dt_class_weight_score))
print('Accuracy of Random Forest Classifier is : {:.3f}'.format(rand_forest_score))

Accuracy of Decision Tree without Class_weight is : 0.754
Accuracy of Decision Tree with Class_weight is : 0.611
Accuracy of Random Forest Classifier is : 0.754


The **Decision Tree** *without class weight* and **Random Forest Classifier** had similar performance with both having an accuracy of **75.4%**