## Introduction

This is an exercise on creating a **Decision Tree** to predict whether a patient would show up to their appointment or not. 

The data has been downloaded from Kaggle: https://www.kaggle.com/joniarroba/noshowappointments

It was then cleaned (in a separate Kernel).

In the first revision of this Kernel we are only going to create a Decision Tree model.

In a later version we are going to try having a logistic regression model.

In [7]:
#importing required libraries

import pandas as pd
import numpy as np

#libraries required for decision tree modelling

from sklearn.model_selection import train_test_split  #for dividing the dataset into train and test segments
from sklearn.preprocessing import LabelEncoder  #for dummy varibale encoding
from sklearn import tree


In [38]:
#uploading the cleaned data into a pd DataFrame

df = pd.read_csv('no_show_data_clean.csv')

In [39]:
df.head()

Unnamed: 0,patientid,gender,scheduledday,appointmentday,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,29872500000000.0,F,2016-04-29 18:38:08+00:00,2016-04-29 00:00:00+00:00,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,M,2016-04-29 16:08:27+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,F,2016-04-29 16:19:04+00:00,2016-04-29 00:00:00+00:00,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,F,2016-04-29 17:29:31+00:00,2016-04-29 00:00:00+00:00,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,F,2016-04-29 16:07:23+00:00,2016-04-29 00:00:00+00:00,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110526 entries, 0 to 110525
Data columns (total 13 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   patientid       110526 non-null  float64
 1   gender          110526 non-null  object 
 2   scheduledday    110526 non-null  object 
 3   appointmentday  110526 non-null  object 
 4   age             110526 non-null  int64  
 5   neighbourhood   110526 non-null  object 
 6   scholarship     110526 non-null  int64  
 7   hipertension    110526 non-null  int64  
 8   diabetes        110526 non-null  int64  
 9   alcoholism      110526 non-null  int64  
 10  handcap         110526 non-null  int64  
 11  sms_received    110526 non-null  int64  
 12  no_show         110526 non-null  object 
dtypes: float64(1), int64(7), object(5)
memory usage: 11.0+ MB


We are going to create the **v.1** of the Decision Tree using the following **9** variables:
1. gender (catigorical)
2. age
3. neighbourhood (catigorical)
4. scholarship
5. hipertension
6. diabetes
7. alcoholism
8. handcap
9. sms_received

The **No_Show** variable is our target variable

We are going to remove the extra columns from the table so that we only have the *9 independant variables* and the *target variable*

In [41]:
#dropping extra columns

columns = ['patientid', 'scheduledday', 'appointmentday']
df.drop(columns, axis = 1, inplace = True)

In [42]:
df.head(3)

Unnamed: 0,gender,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show
0,F,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,M,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,F,62,MATA DA PRAIA,0,0,0,0,0,0,No


Before we proceed to the *Analysis* part, we need to dummy code the **no_show** column as it is our target variable

In [51]:
#dummy coding no_show column

le_show = LabelEncoder()  #creating object of LabelEncoder class

df['no_show_n'] = le_show.fit_transform(df['no_show'])  #adding new column with the dummy codes

df.drop('no_show', axis = 1, inplace = True)

**Note**: a *no_show* value of 0 means that the patient showed up for their appointment

In [52]:
df.head(3)

Unnamed: 0,gender,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show_n
0,F,62,JARDIM DA PENHA,0,1,0,0,0,0,0
1,M,56,JARDIM DA PENHA,0,0,0,0,0,0,0
2,F,62,MATA DA PRAIA,0,0,0,0,0,0,0


In [53]:
#creating a copy of the df for analysis

df_ana = df.copy()

## Analysis - Creating the Decision Tree Model

1. Dividing the data into independant (*x_var*) and target (*y_var*) variables

In [54]:
x_var = df_ana.drop('no_show_n', axis = 1)

In [55]:
y_var = df_ana['no_show_n']

2. Apply dummy coding for the categorical variables in x_var

The following variables need to be dummy encoded:
- gender
- neighbourhoo

In [56]:
#creating objects of LabelEncoder class for every catigorical var

le_gender = LabelEncoder()
le_neigh = LabelEncoder()

In [57]:
#creating columns with the data encoded

x_var['gender_n'] = le_gender.fit_transform(x_var['gender'])
x_var['neighbourhood_n'] = le_neigh.fit_transform(x_var['neighbourhood'])


**Note**: The encodings for the variables are given below:
- *Gender* - Male : 1 , Female : 0
- *Neighbourhood* - too many values

In [59]:
x_var.head(3)

Unnamed: 0,gender,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,gender_n,neighbourhood_n
0,F,62,JARDIM DA PENHA,0,1,0,0,0,0,0,39
1,M,56,JARDIM DA PENHA,0,0,0,0,0,0,1,39
2,F,62,MATA DA PRAIA,0,0,0,0,0,0,0,45


3. Dropping excess columns from *x_var*

In [62]:
x_var.drop(['gender', 'neighbourhood'], axis = 1, inplace = True)

4. Making a Train/Test split

We will make a Train/Test split of *80/20* - This means that 80% of the data will be used to train the model and 20% will be used to test its accuracy

In [68]:
#splitting the data into training and testing datasets

X_train, X_test, y_train, y_test = train_test_split(x_var, y_var, test_size=0.2, random_state=12)

5. Making the **Decision Tree** model

In [69]:
d_tree = tree.DecisionTreeClassifier()

In [70]:
#fitting the model on the training dataset

d_tree.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [71]:
#checking the accuracy score of the decision tree model

d_tree.score(X_test, y_test)

0.7621912602913237

The accuracy of this Decision Tree model is **76%**

## Conclusion

The decision tree takes in **9** metrics and predicts *whether a person would show up to their appointment or not* with **76%** accuracy.

- The decision tree could be improved by using adiditonal perfrmance tuning e.g. **feature engineering**. This will be explored further in version-2 of this decision tree exercise
- We can also try to fit a logistic regression model to this dataset and see how well it performs