## Introduction

This is the best exercise of all. In this exercise we will choose the **best classification model** with the **best performing parameters** for our Patient_No_Show data.

We will use **GridSearchCV** and **RandomizedGridSearchCV** to select the best parameters.

We will create a **model-parameter-grid** which will provide us the best scoring parameters for every model, helping us choose the best model for **Patient No_Show** prediction

As a *recap*, we have the following **Supervised Learning Classifiers/Models**
1. LogisticRegression
2. DecisionTree / RandomForestClassifier
4. Naive Bayes
    - GaussianNB
    - MultinomialNB
    - BernoulliNB

*(We will not be using Support Vector Machine (SVC) as it takes up a lot of computational space)

We will be testing all these models with multiple parameters to identify the best performer.

**Note**: Previously we had identified **DecisionTree/RandomForestClassifier** as the better performing model with an *accuracy score* score of **75.4%**. We will use this as a benchmark


In [20]:
#importing the required libraries

import pandas as pd
import numpy as np

#importing sklearn's libraries

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.preprocessing import MinMaxScaler

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier



In [2]:
#loading the data into the df

df = pd.read_csv('no_show_data_modelling.csv')

In [3]:
df.head()

Unnamed: 0,age,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,gender_n,neighbourhood_n,no_show_n
0,62,0,1,0,0,0,0,0,39,0
1,56,0,0,0,0,0,0,1,39,0
2,62,0,0,0,0,0,0,0,45,0
3,8,0,0,0,0,0,0,0,54,0
4,56,0,1,1,0,0,0,0,39,0


**Note**: There are two issues with this dataset in its current form:
1. The age feature is not normalized. We should normalize this feature.
2. The neighbourhood feature is encoded incorrectly. It has been LabelEncoded, whereas it should have been on-hot encoded i.e. every neighbourhood should have had a separate column. 

The first issue can be resolved in this dataframe easily. 

For the second issue, there are two options:
- We can go back to the original dataset and redo some steps of data cleaning/preparation again, OR
- We convert the dtype of the neighbourhood column to *str* and then on-hot encode them. This will make a separate column for each neighbourhood.

For the purpose of checking if these two steps will improve the accuracy of our model in any way, we will proceed with the latter option, i.e. convert the neighbourhood labels column to string and then one-hot encode it.

In [6]:
df_test = df.copy()

In [7]:
#normalizing the age column using MinMax Scaler

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

df_test['age'] = scaler.fit_transform(df_test[['age']])

In [10]:
#converting the neighbourhood Labels to str and applying one-hot encoding on them

df_test['neighbourhood_n'] = df_test['neighbourhood_n'].astype(str)

In [13]:
#applying get_dummies on the df to convert all str columns

df_test = pd.get_dummies(df_test)

In [14]:
df_test.head()

Unnamed: 0,age,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,gender_n,no_show_n,neighbourhood_n_0,...,neighbourhood_n_73,neighbourhood_n_74,neighbourhood_n_75,neighbourhood_n_76,neighbourhood_n_77,neighbourhood_n_78,neighbourhood_n_79,neighbourhood_n_8,neighbourhood_n_80,neighbourhood_n_9
0,0.53913,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0.486957,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0.53913,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0.069565,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0.486957,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The two changes
- Normalizing the age column
- On-hot encoding the neighbourhood column

have been applied. 

Now we will proceed with creating the prediction model using:
1. RandomForestClassifer
2. AdaBoostClassifier

In [17]:
#defining the x_var (independent/inputs) and the y_var (target)

x_var = df_test.drop('no_show_n', axis = 1)
y_var = df_test['no_show_n']

In [18]:
x_var.head()

Unnamed: 0,age,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,gender_n,neighbourhood_n_0,neighbourhood_n_1,...,neighbourhood_n_73,neighbourhood_n_74,neighbourhood_n_75,neighbourhood_n_76,neighbourhood_n_77,neighbourhood_n_78,neighbourhood_n_79,neighbourhood_n_8,neighbourhood_n_80,neighbourhood_n_9
0,0.53913,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0.486957,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0.53913,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0.069565,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0.486957,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
y_var.head()

0    0
1    0
2    0
3    0
4    0
Name: no_show_n, dtype: int64

In [22]:
#defining model object classes

rf_mod = RandomForestClassifier() #params max_leaf_nodes, criterion
ab_mod = AdaBoostClassifier() #params n_estimators, learning_rate

In [23]:
#Defining model-parameter grid

model_params = {
    'rand_forest' : {
        'model' : rf_mod,
        'params' : {
            'max_leaf_nodes' : [50, 100, 150],
            'criterion' : ['gini', 'entropy']
        }
    },
    'AdaBoost' : {
        'model' : ab_mod,
        'params' : {
            'n_estimators' : [50, 100, 150],
            'learning_rate' : [0.5, 1, 1.5]
        }
    }
}

In [24]:
#running a for-loop to fit the models on the dataset and calculating the best scores and identifying the best_params

scores = []

for model_name, mp in model_params.items():
    clf = GridSearchCV(mp['model'], mp['params'], cv = 5, return_train_score=False)
    clf.fit(x_var, y_var)
    scores.append(
        {'model': model_name,
         'best_score': clf.best_score_,
        'best_params' : clf.best_params_})
    



In [25]:
#creating a DataFrame to see the results

df_model_score = pd.DataFrame(scores, columns=['model', 'best_score', 'best_params'])
df_model_score.sort_values('best_score', ascending = False)

Unnamed: 0,model,best_score,best_params
1,AdaBoost,0.798075,"{'learning_rate': 1.5, 'n_estimators': 50}"
0,rand_forest,0.798066,"{'criterion': 'gini', 'max_leaf_nodes': 50}"


## Conclusion

The **AdaBoost Classifier** performed the best with a **learning rate** of 1.5 and 50 **n_estimators** giving us an accuracy of **79.8**

It was closely followed by **RandomForestClassifier**. 

It seems that the **max avg accuracy score** for this dataset is **79.8%**. This means that our original model of **RandomTreeClassifier** was performing pretty decent.

The **ideal/best parameters** for both AdaBoost and RandomForestClassifier are also given in the table above.