# Model-building phase: supervised approaches
- [Preprocessing](#Data-Preprocessing)
    - [Scaling Data](#Scaling-Data)
    - [Splitting Data](#Splitting-Data)
- [Supervised Models](#Supervised-Models)
    - [Random Forest](#Random-Forest-with-Water-100)
    - [Perceptron](#Perceptron)
    - [SVM](#Support-Vector-Machines-(SVM))


In [40]:
import pandas as pd
import numpy as np
import random
from prep import *
from pprint import pprint
import pickle

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from  sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.linear_model import Perceptron
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier


## Data Preprocessing 

In preprocessing the data we make use of the *prep* function, which simultaneously allows us both to deal with the missing values, giving us the choice of removing them, or partially removing them by replacing the remainder with the mean or median of the corresponding variable, and to scale the data, with the possibility of choosing the method by which to scale such data from all the scalers in scikit-learn, by default the MinMaxScaler is set. The function then takes as input a pandas DataFrame and outputs a numpy ndarray containing the cleaned data from the previous dataset.

Our idea is to generate two datasets: the first by eliminating all observations having at least one component with a missing value, the second by eliminating only 50 percent of those observations. Eventually we will train each model using both datasets and collect their metrics in order to assess whether on average such a reduction in missing values to be eliminated (thus replacing the missing part) resulted in any benefit.

### Scaling Data
To observe if there are difference in performance of the models  trained on a dataset with  all the Nan values imputed and in the models trained on a dataset with only a percentage of Nan values imputed, we have  created 2 dataset:
- one where we imputed all the Nan values (Water 100 dataset)
- one where we imputed only the half Nan values deleting the remain part of observations (Water 50 dataset)

In [41]:
# Setting random state for each model
water = pd.read_csv('dataset/drinking_water_potability.csv')
water50 = prep(
    data = water,
    target='Potability',
    axis='obs',
    perc=50,
    fill_method='mean',
    scaler= StandardScaler()
)
water100 = prep(
    data = water,
    target='Potability',
    axis='obs',
    perc=0,
    fill_method='mean',
    scaler= StandardScaler()
)
print('original dataset size: ', water.shape, '- type: ', type(water))
print('cleaned dataset with 50% of missing values removed: ', np.shape(water50), '- type: ', type(water50))
print('cleaned dataset with no missing values removed: ', np.shape(water100), '- type: ', type(water100))

original dataset size:  (3276, 10) - type:  <class 'pandas.core.frame.DataFrame'>
cleaned dataset with 50% of missing values removed:  (2644, 10) - type:  <class 'numpy.ndarray'>
cleaned dataset with no missing values removed:  (3276, 10) - type:  <class 'numpy.ndarray'>


### Splitting Data
At this point we proceed to divide the dataset into train set, validation set and test set. To do this, we make use of the *train_test_split()* function of scikit-learn.

In [42]:
# To make sure repeatability test
random_seed = 42

In [43]:
#Water 50
X_train50, X_val50, X_test50, y_train50, y_val50, y_test50=split(df = water50,
                                                    target_index = 9,
                                                    validation = True,
                                                    perc_train = 0.7,
                                                    random_seed = random_seed,
                                                    verbose=True
                                                    )

BEFORE SPLITTING: 

X shape:  (2644, 9)
y shape:  (2644,)

AFTER SPLITTING: 
X_train shape:  (1850, 9)
y_train shape:  (1850,)
X_test shape:  (794, 9)
y_test shape:  (794,)
X_val shape:  (397, 9)
y_val shape:  (397,)


In [44]:
#Water 100
X_train100, X_val100, X_test100, y_train100, y_val100, y_test100=split(df = water100,
                                                    target_index = 9,
                                                    perc_train = 0.7,
                                                    random_seed = random_seed,
                                                    verbose=True)

BEFORE SPLITTING: 

X shape:  (3276, 9)
y shape:  (3276,)

AFTER SPLITTING: 
X_train shape:  (2293, 9)
y_train shape:  (2293,)
X_test shape:  (983, 9)
y_test shape:  (983,)
X_val shape:  (491, 9)
y_val shape:  (491,)


## Supervised Models
The Supervised models that we chose are the following:
- **Random Forest**
- **Gradient Boosting**


### **Random Forest** with Water 100
In this part of the report we will train the Random Forest on Water100, that one with all the Nan values imputed.

#### Random Forest Base Model

In [78]:
# Training base model
rf_base = RandomForestClassifier(random_state = random_seed)
rf_base.fit(X_train100, y_train100)
#Computing the base model accuracy
rf_base_accuracy = evaluate(rf_base, X_val100, y_val100)

Model Performance
Accuracy = 80.04%.


#### Tuning Random Forest with Random Search and Cross Validation
 First of all we will start searching the best configuration of hyperparameters with a random search choosing among 
 thousand combinations of hyperparameters 100 random combinations, after this first step we will focus more with a grid 
search around the best combinations found with the random search

In [73]:
#Defining Random Forest Claasifier
rf = RandomForestClassifier(random_state = random_seed)
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())

Parameters currently in use:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}



As we can see there are many hyperparameter that we can tune, but for the moment we will focus more only on  the most importants

[Info about RandomForest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

In [74]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2',None]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Metrics  to measure the quality of a split.
criterion =['gini', 'entropy', 'log_loss']
# Create the random grid
params_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'criterion':criterion}
params_grid


{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000],
 'max_features': ['auto', 'sqrt', 'log2', None],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'min_samples_split': [2, 5, 10],
 'min_samples_leaf': [1, 2, 4],
 'bootstrap': [True, False],
 'criterion': ['gini', 'entropy', 'log_loss']}

Actually if we used all the possible combinations we should train the random forest 25920 times without considering the cross validation for each combinations, that would require a computational time too high, for this reason initially we will use the random search using only 100 random combinations among those available

In [75]:
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
#
rf_random_search = RandomizedSearchCV(estimator = rf, param_distributions = params_grid,
                                n_iter = 100, cv = 3, verbose=3, random_state=random_seed, n_jobs = -1)
# Fit the  model
#
#rf_random.fit(X_train100, y_train100)

#### Store and Load Models
To avoid  traininig each time the models we will save after every training of the models the results inside *Models folder* using pickle library


In [76]:
#Saving
#pickle.dump(rf_random_search, open('Models/Random_Forest_rs_w100.pkl', 'wb'))
#Loading
rf_random_search=pickle.load(open('Models/Random_Forest_rs_w100.pkl', 'rb'))

In [77]:
rf_random_search.best_params_

{'n_estimators': 1400,
 'min_samples_split': 10,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 10,
 'criterion': 'log_loss',
 'bootstrap': True}

#### Comparing base model with tuned model with Random Search
To check if we are going in right direction we will compare the base model without tuning of parameters with the tuned model, comparing the accuracy of both models on validation set

In [79]:
# Training tuned model
rf_random =  RandomForestClassifier(bootstrap = True,criterion='log_loss',max_depth=10,max_features='auto',
                                      min_samples_leaf=1,min_samples_split=10,n_estimators=1400,
                                      random_state=random_seed)
rf_random.fit(X_train100,y_train100)
#Computing the tuned model accuracy                                 
rf_random_accuracy = evaluate(rf_random, X_val100, y_val100)

print('The tuned model had an improvement of {:0.2f}%.'.format( 100 * (rf_random_accuracy - rf_base_accuracy) / rf_base_accuracy))


  warn(


Model Performance
Accuracy = 80.45%.
The tuned model had an improvement of 0.51%.


#### Tuning Random Forest with Grid Search and Cross Validation
Random search allowed us to narrow down the range for each hyperparameter. Now that we know where to concentrate our search, we can explicitly specify every combination of settings to try. We do this with GridSearchCV, a method that, instead of sampling randomly from a distribution, evaluates all combinations we define. To use Grid Search, we make another grid based on the best values provided by random search:

In [80]:
#Defining Random Forest Claasifier
rf = RandomForestClassifier(random_state = random_seed)

In [81]:
rf_random_search.best_params_

{'n_estimators': 1400,
 'min_samples_split': 10,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 10,
 'criterion': 'log_loss',
 'bootstrap': True}

In [83]:

# Create the parameter grid based on the results of random search 
params_grid = {
    'bootstrap': [True],
    'criterion':['log_loss'],
    'max_depth': [5,10,15],
    'max_features': ['auto'],
    'min_samples_leaf': [1,2,3],
    'min_samples_split': [5,10,15],
    'n_estimators': [1300,1400,1500]
}
# Instantiate the grid search model
rf_grid_search = GridSearchCV(estimator = rf, param_grid = params_grid, 
                          cv = 5, n_jobs = -1, verbose = 1)
# Fit the grid search to the data
#rf_grid.fit(X_train100, y_train100)

In [84]:
#Saving
#pickle.dump(rf_grid_search, open('Models/Random_Forest_rs_w100.pkl', 'wb'))
#Loading
rf_grid_search = pickle.load(open('Models/Random_Forest_gs_w100.pkl', 'rb'))

In [85]:
rf_grid_search.best_params_

{'bootstrap': True,
 'criterion': 'log_loss',
 'max_depth': 10,
 'max_features': 'auto',
 'min_samples_leaf': 2,
 'min_samples_split': 15,
 'n_estimators': 1500}

#### Comparing tuned models 
To check if we are going in right direction we will compare the base model without tuning of parameters with the tuned model comparing the accuracy of both models on validation set

In [86]:
# Training tuned model
rf_final =  RandomForestClassifier(bootstrap = True,criterion='log_loss',max_depth=10,max_features='auto',
                                      min_samples_leaf=2,min_samples_split=15,n_estimators=1500,
                                      random_state=random_seed)
rf_final.fit(X_train100,y_train100)
#Computing the tuned model accuracy                                 
rf_grid_accuracy = evaluate(rf_final, X_val100, y_val100)

print('The tuned model with grid search had an improvement of {:0.2f}%. respect the base model'
                        .format( 100 * (rf_grid_accuracy - rf_base_accuracy) / rf_base_accuracy))

print('The tuned model with grid search had an improvement of {:0.2f}%. respect the tuned  model with random search'
                        .format( 100 * (rf_grid_accuracy - rf_random_accuracy) / rf_random_accuracy))


  warn(


Model Performance
Accuracy = 80.86%.
The tuned model with grid search had an improvement of 1.02%. respect the base model
The tuned model with grid search had an improvement of 0.51%. respect the tuned  model with random search


### Gradient Boosting with Water 100
In this part of the report we will train the Gradient Boosting on Water100, that one with all the Nan values imputed.

In boosting, the individual models are not built on completely random subsets of data and features but sequentially by putting more weight on instances with wrong predictions and high errors. The general idea behind this is that instances, which are hard to predict correctly (“difficult” cases) will be focused on during learning, so that the model learns from past mistakes. When we train each ensemble on a subset of the training set, we also call this Stochastic Gradient Boosting, which can help improve generalizability of our model.The gradient is used to minimize a loss function, similar to how Neural Nets utilize gradient descent to optimize (“learn”) weights. 

Source: [link](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)

#### Gradient Boosting Base Model

In [87]:
# Training base model
gb_base = GradientBoostingClassifier( random_state = random_seed)
gb_base.fit(X_train100, y_train100)
#Computing the base model accuracy
gb_base_accuracy = evaluate(gb_base, X_val100, y_val100)

Model Performance
Accuracy = 78.82%.


#### Tuning Gradient Boosting with Random Search and Cross Validation

First of all we will start searching the best configuration of hyperparameters with a random search choosing among 
 thousand combinations of hyperparameters 100 random combinations, after this first step we will focus more with a grid 
search around the best combinations found with the random search

In [88]:
#Defining Random Forest Claasifier
gb = GradientBoostingClassifier(random_state = random_seed)
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(gb.get_params())

Parameters currently in use:

{'ccp_alpha': 0.0,
 'criterion': 'friedman_mse',
 'init': None,
 'learning_rate': 0.1,
 'loss': 'log_loss',
 'max_depth': 3,
 'max_features': None,
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_iter_no_change': None,
 'random_state': 42,
 'subsample': 1.0,
 'tol': 0.0001,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}


In [90]:
# The loss function to be optimized.
loss = ['log_loss','exponential']
# Learning rate shrinks the contribution of each tree by learning_rate
learning_rate=[0.001,0.01,0.1]
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2',None]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Metrics  to measure the quality of a split.
criterion =['friedman_mse','squared_error']
# Create the random grid
params_grid = {'loss': loss,
               'learning_rate':learning_rate,
               'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'criterion':criterion}
pprint(params_grid)


{'criterion': ['friedman_mse', 'squared_error'],
 'learning_rate': [0.001, 0.01, 0.1],
 'loss': ['log_loss', 'exponential'],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt', 'log2', None],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


Actually if we used all the possible combinations we should train the Gradient Boosting 103000 times without considering the cross validation for each combinations, that would require a computational time too high, for this reason initially we will use the random search using only 100 random combinations among those available

In [91]:
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
#
gb_random_search = RandomizedSearchCV(estimator = gb, param_distributions = params_grid,
                                n_iter = 100, cv = 3, verbose=3, random_state=random_seed, n_jobs = -1)
# Fit the  model
#
#gb_random_search.fit(X_train100, y_train100)

#### Store and Load Models
To avoid  traininig each time the models we will save after every training of the models the results inside *Models folder* using pickle library


In [99]:
#Saving
#pickle.dump(gb_random_search.best_params_, open('Models/Gradient_Boosting_rs_w100.pkl', 'wb'))

#Loading
gb_params_best_random=pickle.load(open('Models/Gradient_Boosting_rs_w100.pkl', 'rb'))


In [100]:
gb_params_best_random

{'n_estimators': 1600,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 20,
 'loss': 'exponential',
 'learning_rate': 0.01,
 'criterion': 'friedman_mse'}

#### Comparing base model with tuned model with Random Search
To check if we are going in right direction we will compare the base model without tuning of parameters with the tuned model, comparing the accuracy of both models on validation set

In [97]:
# Training tuned model
gb_random =  GradientBoostingClassifier(loss='exponential',max_depth=20,max_features='sqrt',
                                      min_samples_leaf=1,min_samples_split=2,n_estimators=1600,
                                      learning_rate=0.01,criterion='friedman_mse',
                                      random_state=random_seed)
gb_random.fit(X_train100,y_train100)
#Computing the tuned model accuracy                                 
gb_random_accuracy = evaluate(gb_random, X_val100, y_val100)

print('The tuned model with grid search had an improvement of {:0.2f}%. respect the base model'
                    .format( 100 * (gb_random_accuracy - gb_base_accuracy) / gb_base_accuracy))

Model Performance
Accuracy = 79.23%.
The tuned model with grid search had an improvement of 0.52%. respect the base model


#### Grid Search with Cross Validation
Random search allowed us to narrow down the range for each hyperparameter. Now that we know where to concentrate our search, we can explicitly specify every combination of settings to try. We do this with GridSearchCV, a method that, instead of sampling randomly from a distribution, evaluates all combinations we define. To use Grid Search, we make another grid based on the best values provided by random search:

In [101]:
gb_params_best_random

{'n_estimators': 1600,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 20,
 'loss': 'exponential',
 'learning_rate': 0.01,
 'criterion': 'friedman_mse'}

In [102]:

# Create the parameter grid based on the results of random search 
params_grid = {
    'loss': ['exponential'],
    'learning_rate': [0.01],
    'criterion':['friedman_mse'],
    'max_depth': [15,20,25],
    'max_features': ['sqrt'],
    'min_samples_leaf': [1,2,3],
    'min_samples_split': [1,2,3],
    'n_estimators': [1500,1600,1700]
}

# Instantiate the grid search model
gb_grid_search = GridSearchCV(estimator = gb, param_grid = params_grid, 
                          cv = 5, n_jobs = -1, verbose = 1)
                          # Fit the grid search to the data
gb_grid_search.fit(X_train100, y_train100)

Fitting 5 folds for each of 81 candidates, totalling 405 fits


In [107]:
#Saving
#pickle.dump(gb_grid_search.best_params_, open('Models/Gradient_Boosting_gs_w100.pkl', 'wb'))

#Loading
gb_params_best_random=pickle.load(open('Models/Gradient_Boosting_gs_w100.pkl', 'rb'))


In [108]:
gb_params_best_random

{'criterion': 'friedman_mse',
 'learning_rate': 0.01,
 'loss': 'exponential',
 'max_depth': 25,
 'max_features': 'sqrt',
 'min_samples_leaf': 3,
 'min_samples_split': 1,
 'n_estimators': 1600}

#### Comparing tuned models 
To check if we are going in right direction we will compare the base model without tuning of parameters with the tuned model comparing the accuracy of both models on validation set

In [112]:
# Training tuned model
gb_final =  GradientBoostingClassifier(loss='exponential',max_depth=25,max_features='sqrt',
                                      min_samples_leaf=3,min_samples_split=1,n_estimators=1600,
                                      learning_rate=0.01,criterion='friedman_mse',
                                      random_state=random_seed)
gb_final.fit(X_train100,y_train100)
#Computing the tuned model accuracy                                 
gb_final_accuracy = evaluate(gb_final, X_val100, y_val100)

print('The tuned model with grid search had an improvement of {:0.2f}%. respect the base model'
                    .format( 100 * (gb_final_accuracy - gb_base_accuracy) / gb_base_accuracy))

print('The tuned model with grid search had an improvement of {:0.2f}%. respect the tuned  model with random search'
                        .format( 100 * (gb_final_accuracy - gb_random_accuracy) / gb_random_accuracy))

Model Performance
Accuracy = 79.84%.
The tuned model with grid search had an improvement of 1.29%. respect the base model
The tuned model with grid search had an improvement of 0.77%. respect the tuned  model with random search


### Perceptron

The perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. The *perceptron* is suitable for large scale learning. By default:
- It does not require a learning rate.
- It is not regularized (penalized).
- It updates its model only on mistakes.

source: [link](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron)

In [13]:
perc_model = Perceptron(
    penalty = 'l2', # The penalty (aka regularization term) to be used.
    alpha = 0.0001, # Constant that multiplies the regularization term if regularization is used.
    fit_intercept=False, # Whether the intercept should be estimated or not. If False, the data is assumed to be already centered.
    shuffle= True, # Whether or not the training data should be shuffled after each epoch.
    verbose = 1, # The verbosity level.
    random_state = random_seed, # Used to shuffle the training data, when shuffle is set to True. Pass an int for reproducible output across multiple function calls.
)

print('Parameters currently in use:\n')
pprint(perc_model.get_params())

Parameters currently in use:

{'alpha': 0.0001,
 'class_weight': None,
 'early_stopping': False,
 'eta0': 1.0,
 'fit_intercept': False,
 'l1_ratio': 0.15,
 'max_iter': 1000,
 'n_iter_no_change': 5,
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': 42,
 'shuffle': True,
 'tol': 0.001,
 'validation_fraction': 0.1,
 'verbose': 1,
 'warm_start': False}


In [15]:
perc_model.fit(X = X_train100, y = y_train100)

-- Epoch 1


AttributeError: 'sklearn.linear_model._sgd_fast._memoryviewslice' object has no attribute 'nonzero'

### Support Vector Machines (SVM)

## Evalution Models

| Models              | Test Accuracy | Test Recall | Test Precision | F1 Score |
|---------------------|---------------|-------------|----------------|----------|
| Logistic Regression |               |             |                |          |
| Random Forest       |               |             |                |          |
| K-NN                |               |             |                |          |
| Orazio              |               |             |                |          |