## Introduction

This is the best exercise of all. In this exercise we will choose the **best classification model** with the **best performing parameters** for our Patient_No_Show data.

We will use **GridSearchCV** and **RandomizedGridSearchCV** to select the best parameters.

We will create a **model-parameter-grid** which will provide us the best scoring parameters for every model, helping us choose the best model for **Patient No_Show** prediction

As a *recap*, we have the following **Supervised Learning Classifiers/Models**
1. LogisticRegression
2. DecisionTree / RandomForestClassifier
4. Naive Bayes
    - GaussianNB
    - MultinomialNB
    - BernoulliNB

*(We will not be using Support Vector Machine (SVC) as it takes up a lot of computational space)

We will be testing all these models with multiple parameters to identify the best performer.

**Note**: Previously we had identified **DecisionTree/RandomForestClassifier** as the better performing model with an *accuracy score* score of **75.4%**. We will use this as a benchmark


In [14]:
#importing the required libraries

import pandas as pd
import numpy as np

#importing sklearn's libraries

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB



In [9]:
#loading the data into the df

df = pd.read_csv('no_show_data_modelling.csv')

In [10]:
df.head()

Unnamed: 0,age,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,gender_n,neighbourhood_n,no_show_n
0,62,0,1,0,0,0,0,0,39,0
1,56,0,0,0,0,0,0,1,39,0
2,62,0,0,0,0,0,0,0,45,0
3,8,0,0,0,0,0,0,0,54,0
4,56,0,1,1,0,0,0,0,39,0


In [11]:
#defining the x_var (independent/inputs) and the y_var (target)

x_var = df.drop('no_show_n', axis = 1)
y_var = df['no_show_n']

In [12]:
x_var.head()

Unnamed: 0,age,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,gender_n,neighbourhood_n
0,62,0,1,0,0,0,0,0,39
1,56,0,0,0,0,0,0,1,39
2,62,0,0,0,0,0,0,0,45
3,8,0,0,0,0,0,0,0,54
4,56,0,1,1,0,0,0,0,39


In [13]:
y_var.head()

0    0
1    0
2    0
3    0
4    0
Name: no_show_n, dtype: int64

In [26]:
#defining model object classes

log_mod = LogisticRegression() #params C, class_weight, solver, max_iter
rf_mod = RandomForestClassifier() #params max_leaf_nodes, criterion
nb_gaussian = GaussianNB()
nb_multinomial = MultinomialNB()
nb_bernoulli = BernoulliNB() # params fit_prior

In [27]:
#Defining model-parameter grid

model_params = {
    'log_regression' : {
        'model' : log_mod, 
        'params' : {
            'C' : [1,5,10],
            'class_weight' : [None, 'balanced'],
            'solver' : ['lbfgs', 'liblinear', 'sag'],
            'max_iter' : [100, 350, 500]
                        }
    },
    'rand_forest' : {
        'model' : rf_mod,
        'params' : {
            'max_leaf_nodes' : [50, 100, 150],
            'criterion' : ['gini', 'entropy']
        }
    },
    'gaussian_NB' : {
        'model' : nb_gaussian,
        'params' : {}
    },
    'multinomial_NB' : {
        'model' : nb_multinomial,
        'params' : {}
    },
    'bernoulli_NB' : {
        'model' : nb_bernoulli,
        'params' : {
            'fit_prior' : [True, False]
        }
    }
}

In [28]:
#running a for-loop to fit the models on the dataset and calculating the best scores and identifying the best_params

scores = []

for model_name, mp in model_params.items():
    clf = GridSearchCV(mp['model'], mp['params'], cv = 5, return_train_score=False)
    clf.fit(x_var, y_var)
    scores.append(
        {'model': model_name,
         'best_score': clf.best_score_,
        'best_params' : clf.best_params_})
    



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

In [29]:
#creating a DataFrame to see the results

df_model_score = pd.DataFrame(scores, columns=['model', 'best_score', 'best_params'])
df_model_score.sort_values('best_score', ascending = False)

Unnamed: 0,model,best_score,best_params
1,rand_forest,0.798075,"{'criterion': 'entropy', 'max_leaf_nodes': 100}"
0,log_regression,0.798066,"{'C': 1, 'class_weight': None, 'max_iter': 100..."
4,bernoulli_NB,0.798066,{'fit_prior': True}
2,gaussian_NB,0.786141,{}
3,multinomial_NB,0.704332,{}


## Conclusion

The **Random Forest Classifier** performed the best, closely followed by **Logistic Regression** and **Bernoulli Naive Bayes** Classifiers. 

The **max avg accuracy score** for this dataset is **79.8%**. This means that our original model of **RandomTreeClassifier** was performing pretty decent.

The **ideal/best parameters** are also given in the table above.