# Model-building phase: supervised approaches

In [1]:
import pandas as pd
import numpy as np
import random
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from prep import *

## Data preprocessing

In preprocessing the data we make use of the *prep* function, which simultaneously allows us both to deal with the missing values, giving us the choice of removing them, or partially removing them by replacing the remainder with the mean or median of the corresponding variable, and to scale the data, with the possibility of choosing the method by which to scale such data from all the scalers in scikit-learn, by default the MinMaxScaler is set. The function then takes as input a pandas DataFrame and outputs a numpy ndarray containing the cleaned data from the previous dataset.

Our idea is to generate two datasets: the first by eliminating all observations having at least one component with a missing value, the second by eliminating only 50 percent of those observations. Eventually we will train each model using both datasets and collect their metrics in order to assess whether on average such a reduction in missing values to be eliminated (thus replacing the missing part) resulted in any benefit.

In [6]:
random.seed(13)
water = pd.read_csv('dataset/drinking_water_potability.csv')
water0 = prep( 
    df = water,
    axis='obs',
    perc=100,
    fill_method='mean',
    scaler= preprocessing.MinMaxScaler()
    )
water50 = prep(
    df = water,
    axis='obs',
    perc=50,
    fill_method='mean',
    scaler= preprocessing.MinMaxScaler()
)
water100 = prep(
    df = water,
    axis='obs',
    perc=0,
    fill_method='mean',
    scaler= preprocessing.MinMaxScaler()
)
print('original dataset size: ', water.shape, '- type: ', type(water))
print('cleaned dataset with all of missing values removed: ', np.shape(water0), '- type: ', type(water0))
print('cleaned dataset with 50% of missing values removed: ', np.shape(water50), '- type: ', type(water50))
print('cleaned dataset with 100% of missing values removed: ', np.shape(water100), '- type: ', type(water100))

original dataset size:  (3276, 10) - type:  <class 'pandas.core.frame.DataFrame'>
cleaned dataset with all of missing values removed:  (2011, 10) - type:  <class 'numpy.ndarray'>
cleaned dataset with 50% of missing values removed:  (2515, 10) - type:  <class 'numpy.ndarray'>
cleaned dataset with 100% of missing values removed:  (2993, 10) - type:  <class 'numpy.ndarray'>


At this point we proceed to divide the dataset into train set, validation set and test set. To do this, we make use of the *train_test_split()* function of scikit-learn.

In [48]:
X_train, y_train,X_val, X_test, y_val, y_test=splitting_func(water0)

BEFORE SPLITTING: 

X_water0 shape:  (2011, 8)
y_water0 shape:  (2011,)

AFTER SPLITTING: 
X_train0 shape:  (1206, 8)
X_val0 shape:  (402, 8)
X_test0 shape:  (403, 8)
y_train0 shape:  (1206,)
y_val0 shape:  (402,)
y_test0 shape:  (403,)


In [49]:
X_train, y_train,X_val, X_test, y_val, y_test=splitting_func(water50)

BEFORE SPLITTING: 

X_water0 shape:  (2515, 8)
y_water0 shape:  (2515,)

AFTER SPLITTING: 
X_train0 shape:  (1509, 8)
X_val0 shape:  (503, 8)
X_test0 shape:  (503, 8)
y_train0 shape:  (1509,)
y_val0 shape:  (503,)
y_test0 shape:  (503,)


In [50]:
X_train, y_train,X_val, X_test, y_val, y_test=splitting_func(water100)

BEFORE SPLITTING: 

X_water0 shape:  (2993, 8)
y_water0 shape:  (2993,)

AFTER SPLITTING: 
X_train0 shape:  (1795, 8)
X_val0 shape:  (599, 8)
X_test0 shape:  (599, 8)
y_train0 shape:  (1795,)
y_val0 shape:  (599,)
y_test0 shape:  (599,)


# Training Supervised Models
The Supervised models that we chose are the following:

| Models              | Test Accuracy | Test Recall | Test Precision | F1 Score |
|---------------------|---------------|-------------|----------------|----------|
| Logistic Regression |               |             |                |          |
| Random Forest       |               |             |                |          |
| K-NN                |               |             |                |          |
| Orazio              |               |             |                |          |


## Random Forest

In [2]:
#Setting random state for each model
random_state = 42

In [3]:
from sklearn.ensemble import RandomForestClassifier
from pprint import pprint

rf = RandomForestClassifier(random_state = random_state)

# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())


Parameters currently in use:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}



As we can see there are many hyperparameter that we can tune, but for the moment we will focus more only on  the most importants
[Info about RandomForest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

### Random Search with Cross Validation

In [70]:
from sklearn.model_selection import RandomizedSearchCV
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2',None]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Metrics  to measure the quality of a split.
criterion =['gini', 'entropy', 'log_loss']
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'criterion':criterion}
pprint(random_grid)


{'bootstrap': [True, False],
 'criterion': ['gini', 'entropy', 'log_loss'],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt', 'log2', None],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


Actually if we used all the possible combinations we should train the random forest 25920 times without considering the cross validation for each combinations that is a crazy number,so initially we will use the random search

In [7]:
# 70% of total values for train
X_train, X_test, y_train, y_test = train_test_split(water100[:,0:8], water100[:,9], test_size=0.7, random_state=random_state) 

In [72]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 5 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 5, verbose=2, random_state=random_state, n_jobs = -1)
# Fit the random search model
##############################
##rf_random.fit(X_train, y_train)
###############################

Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [98]:
bestparameterRandomForest={'bootstrap': True,
 'criterion': 'gini',
 'max_depth': 80,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 600}
 #pprint(rf_random.best_params_)

In [89]:
from  sklearn.metrics import accuracy_score
def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    accuracy =accuracy_score(test_labels, predictions)
    print('Model Performance')
    #print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy


In [8]:
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=random_state) 

In [96]:
base_model = RandomForestClassifier(n_estimators = 10, random_state = 42)
base_model.fit(X_train, y_train)
base_accuracy = evaluate(base_model, X_val, y_val)

#best_random = rf_random.best_estimator_
{'bootstrap': True,
 'criterion': 'gini',
 'max_depth': 80,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 600}
best_random =  RandomForestClassifier(bootstrap = True,criterion='gini',max_depth=80,max_features='sqrt',
                                      min_samples_leaf=1,min_samples_split=2,n_estimators=600,
                                      random_state=random_state)
best_random.fit(X_train,y_train)                                      
random_accuracy = evaluate(best_random, X_val, y_val)

print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))


Model Performance
Accuracy = 0.66%.
Model Performance
Accuracy = 0.69%.
Improvement of 4.91%.


### Grid Search with Cross Validation
Random search allowed us to narrow down the range for each hyperparameter. Now that we know where to concentrate our search, we can explicitly specify every combination of settings to try. We do this with GridSearchCV, a method that, instead of sampling randomly from a distribution, evaluates all combinations we define. To use Grid Search, we make another grid based on the best values provided by random search:

In [101]:
pprint(bestparameterRandomForest)


{'bootstrap': True,
 'criterion': 'gini',
 'max_depth': 80,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 600}


In [9]:
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'criterion':['gini'],
    'max_depth': [70,80,90],
    'max_features': ['sqrt'],
    'min_samples_leaf': [1,2,3],
    'min_samples_split': [1,2,3],
    'n_estimators': [550,600,650]
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 1)

In [10]:
# Fit the grid search to the data
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 81 candidates, totalling 405 fits


In [13]:
bestgridparameter={'bootstrap': True,
 'criterion': 'gini',
 'max_depth': 90,
 'max_features': 'sqrt',
 'min_samples_leaf': 3,
 'min_samples_split': 1,
 'n_estimators': 600}

In [None]:
'''grid_search.best_params_
{'bootstrap': True,
 'max_depth': 80,
 'max_features': 3,
 'min_samples_leaf': 5,
 'min_samples_split': 12,
 'n_estimators': 100}
best_grid = grid_search.best_estimator_
grid_accuracy = evaluate(best_grid, test_features, test_labels)
Model Performance
Average Error: 3.6561 degrees.
Accuracy = 93.83%.
print('Improvement of {:0.2f}%.'.format( 100 * (grid_accuracy - base_accuracy) / base_accuracy))
Improvement of 0.50%.'''