# Model-building phase
- [Data Preprocessing](#Data-Preprocessing)
- [Supervised Models](#Supervised-Models)
    - [Random Forest](#Random-Forest)
    - [Perceptron](#Perceptron)
    - [SVM](#Support-Vector-Machines-(SVM))


In [1]:
import pandas as pd
import numpy as np
import random
from prep import *
from pprint import pprint
import pickle

from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from  sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.linear_model import Perceptron
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

## Data Preprocessing 

In preprocessing the data we make use of the *prep* function, which simultaneously allows us both to deal with the missing values, giving us the choice of removing them, or partially removing them by replacing the remainder with the mean or median of the corresponding variable, and to scale the data, with the possibility of choosing the method by which to scale such data from all the scalers in scikit-learn, by default the MinMaxScaler is set. The function then takes as input a pandas DataFrame and outputs a numpy ndarray containing the cleaned data from the previous dataset.

Our idea is to generate two datasets: the first by eliminating all observations having at least one component with a missing value, the second by eliminating only 50 percent of those observations. Eventually we will train each model using both datasets and collect their metrics in order to assess whether on average such a reduction in missing values to be eliminated (thus replacing the missing part) resulted in any benefit.

### Scaling Data
To observe if there are difference in performance of the models  trained on a dataset with  all the Nan values imputed and in the models trained on a dataset with only a percentage of Nan values imputed, we have  created 2 dataset:
- one where we imputed all the Nan values (Water 100 dataset)
- one where we imputed only the half Nan values deleting the remain part of observations (Water 50 dataset)

In [3]:
# Setting random state for each model
water = pd.read_csv('dataset/drinking_water_potability.csv')
water50 = prep(
    data = water,
    target='Potability',
    axis='obs',
    perc=50,
    fill_method='mean',
    scaler= StandardScaler()
)
water100 = prep(
    data = water,
    target='Potability',
    axis='obs',
    perc=0,
    fill_method='mean',
    scaler= StandardScaler()
)
print('original dataset size: ', water.shape, '- type: ', type(water))
print('cleaned dataset with 50% of missing values removed: ', np.shape(water50), '- type: ', type(water50))
print('cleaned dataset with no missing values removed: ', np.shape(water100), '- type: ', type(water100))

original dataset size:  (3276, 10) - type:  <class 'pandas.core.frame.DataFrame'>
cleaned dataset with 50% of missing values removed:  (2644, 10) - type:  <class 'numpy.ndarray'>
cleaned dataset with no missing values removed:  (3276, 10) - type:  <class 'numpy.ndarray'>


### Splitting Data
At this point we proceed to divide the dataset into train set, validation set and test set. To do this, we make use of the *train_test_split()* function of scikit-learn.

In [5]:
# To make sure repeatability test
random_seed = 42

In [5]:
#Water 50
X_train50, X_val50, X_test50, y_train50, y_val50, y_test50=split(df = water50,
                                                    target_index = 9,
                                                    validation = True,
                                                    perc_train = 0.7,
                                                    random_seed = random_seed,
                                                    verbose=True
                                                    )

BEFORE SPLITTING: 

X shape:  (2644, 9)
y shape:  (2644,)

AFTER SPLITTING: 
X_train shape:  (1850, 9)
y_train shape:  (1850,)
X_test shape:  (794, 9)
y_test shape:  (794,)
X_val shape:  (397, 9)
y_val shape:  (397,)


In [6]:
#Water 100
X_train100, X_val100, X_test100, y_train100, y_val100, y_test100=split(df = water100,
                                                    target_index = 9,
                                                    perc_train = 0.7,
                                                    random_seed = random_seed,
                                                    verbose=True)

BEFORE SPLITTING: 

X shape:  (3276, 9)
y shape:  (3276,)

AFTER SPLITTING: 
X_train shape:  (2293, 9)
y_train shape:  (2293,)
X_test shape:  (983, 9)
y_test shape:  (983,)
X_val shape:  (491, 9)
y_val shape:  (491,)


## Supervised Models
The Supervised models that we chose are the following:
- **Random Forest**
- **XGBoost**


### Random Forest
In this part of the report we will train the Random Forest on Water100, that one with all the Nan values imputed.

In [9]:
#Defining Random Forest Claasifier
rf = RandomForestClassifier(random_state = random_seed)
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())

Parameters currently in use:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}



As we can see there are many hyperparameter that we can tune, but for the moment we will focus more only on  the most importants

[Info about RandomForest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

#### Random Search with Cross Validation
 First of all we will start searching the best configuration of hyperparameters with a random search choosing among 
 thousand combinations of hyperparameters 100 random combinations, after this first step we will focus more with a grid 
search around the best combinations found with the random search

In [10]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt','log2',None]
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Metrics  to measure the quality of a split.
criterion =['gini', 'entropy', 'log_loss']
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap,
               'criterion':criterion}
pprint(random_grid)


{'bootstrap': [True, False],
 'criterion': ['gini', 'entropy', 'log_loss'],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt', 'log2', None],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


Actually if we used all the possible combinations we should train the random forest 25920 times without considering the cross validation for each combinations, that would require a computational time too high, for this reason initially we will use the random search using only 100 random combinations among those available

In [11]:
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
#
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,
                                n_iter = 100, cv = 3, verbose=3, random_state=random_seed, n_jobs = -1)
# Fit the  model
#
#rf_random.fit(X_train100, y_train100)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


  warn(


#### Store and Load Models
To avoid  traininig each time the models we will save after every training of the models the results inside *Models folder* using pickle library


In [19]:
#Saving
#pickle.dump(rf_random, open('Models/Random_Forest_rs_w100.pkl', 'wb'))
#Loading
rf_random=pickle.load(open('Models/Random_Forest_rs_w100.pkl', 'rb'))

In [44]:
rf_random.best_params_

{'n_estimators': 1400,
 'min_samples_split': 10,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 10,
 'criterion': 'log_loss',
 'bootstrap': True}

#### Comparing base model with tuned model with Random Search
To check if we are going in right direction we will compare the base model without tuning of parameters with the tuned model comparing the accuracy of both models on validation set

In [13]:

def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    accuracy = accuracy_score(test_labels, predictions)
    print('Model Performance')
    print('Accuracy = {:0.2f}%.'.format(accuracy*100))
    return accuracy


In [41]:
# Training base model
base_model = RandomForestClassifier(n_estimators = 10, random_state = random_seed)
base_model.fit(X_train100, y_train100)
#Computing the base model accuracy
base_accuracy = evaluate(base_model, X_val100, y_val100)

# Training tuned model
best_random =  RandomForestClassifier(bootstrap = True,criterion='log_loss',max_depth=10,max_features='auto',
                                      min_samples_leaf=1,min_samples_split=10,n_estimators=1400,
                                      random_state=random_seed)
best_random.fit(X_train100,y_train100)
#Computing the tuned model accuracy                                 
random_accuracy = evaluate(best_random, X_val100, y_val100)

print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))


Model Performance
Accuracy = 79.63%.


  warn(


Model Performance
Accuracy = 80.45%.
Improvement of 1.02%.


#### Grid Search with Cross Validation
Random search allowed us to narrow down the range for each hyperparameter. Now that we know where to concentrate our search, we can explicitly specify every combination of settings to try. We do this with GridSearchCV, a method that, instead of sampling randomly from a distribution, evaluates all combinations we define. To use Grid Search, we make another grid based on the best values provided by random search:

In [45]:
rf_random.best_params_

{'n_estimators': 1400,
 'min_samples_split': 10,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 10,
 'criterion': 'log_loss',
 'bootstrap': True}

In [33]:

# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'criterion':['log_loss'],
    'max_depth': [5,10,15],
    'max_features': ['auto'],
    'min_samples_leaf': [1,2,3],
    'min_samples_split': [5,10,15],
    'n_estimators': [1300,1400,1500]
}
# Instantiate the grid search model
rf_grid = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 5, n_jobs = -1, verbose = 1)
                          # Fit the grid search to the data
#rf_grid.fit(X_train100, y_train100)

Fitting 5 folds for each of 81 candidates, totalling 405 fits


  warn(


In [36]:
#Saving
#pickle.dump(rf_random, open('Models/Random_Forest_rs_w100.pkl', 'wb'))
#Loading
rf_grid = pickle.load(open('Models/Random_Forest_gs_w100.pkl', 'rb'))

In [38]:
rf_grid.best_params_

{'bootstrap': True,
 'criterion': 'log_loss',
 'max_depth': 10,
 'max_features': 'auto',
 'min_samples_leaf': 2,
 'min_samples_split': 15,
 'n_estimators': 1500}

#### Comparing base model with tuned model with Grid Search
To check if we are going in right direction we will compare the base model without tuning of parameters with the tuned model comparing the accuracy of both models on validation set

In [43]:
# Training base model
base_model = RandomForestClassifier(n_estimators = 10, random_state = random_seed)
base_model.fit(X_train100, y_train100)
#Computing the base model accuracy
base_accuracy = evaluate(base_model, X_val100, y_val100)

# Training tuned model
best_grid =  RandomForestClassifier(bootstrap = True,criterion='log_loss',max_depth=10,max_features='auto',
                                      min_samples_leaf=2,min_samples_split=15,n_estimators=1500,
                                      random_state=random_seed)
best_grid.fit(X_train100,y_train100)
#Computing the tuned model accuracy                                 
grid_accuracy = evaluate(best_grid, X_val100, y_val100)

print('Improvement of {:0.2f}%. tuned model with grid search respect base model'.format( 100 * (grid_accuracy - base_accuracy) / base_accuracy))
print('Improvement of {:0.2f}%. tuned model with grid search respect tuned model with random search'.format( 100 * (grid_accuracy - random_accuracy) / random_accuracy))


Model Performance
Accuracy = 79.63%.


  warn(


Model Performance
Accuracy = 80.86%.
Improvement of 1.53%. tuned model with grid search respect base model
Improvement of 0.51%. tuned model with grid search respect tuned model with random search


### Perceptron

The perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. The *perceptron* is suitable for large scale learning. By default:
- It does not require a learning rate.
- It is not regularized (penalized).
- It updates its model only on mistakes.

source: [link](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html#sklearn.linear_model.Perceptron)

In [7]:
perc_model = Perceptron(
    verbose = 1,
    random_state = random_seed,
)

print('Parameters currently in use:\n')
pprint(perc_model.get_params())

Parameters currently in use:

{'alpha': 0.0001,
 'class_weight': None,
 'early_stopping': False,
 'eta0': 1.0,
 'fit_intercept': True,
 'l1_ratio': 0.15,
 'max_iter': 1000,
 'n_iter_no_change': 5,
 'n_jobs': None,
 'penalty': None,
 'random_state': 42,
 'shuffle': True,
 'tol': 0.001,
 'validation_fraction': 0.1,
 'verbose': 1,
 'warm_start': False}


In [8]:
perc_model.fit(X = X_train100, y = y_train100)

-- Epoch 1


AttributeError: 'sklearn.linear_model._sgd_fast._memoryviewslice' object has no attribute 'nonzero'

### Support Vector Machines (SVM)

In [16]:
svmodel = SVC(
    random_state= random_seed
)

print('Parameters currently in use:\n')
pprint(svmodel.get_params())

Parameters currently in use:

{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': 42,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}


In [17]:
svmodel.fit(X = X_train100, y = y_train100)
evaluate(svmodel, X_val100, y_val100)

Model Performance
Accuracy = 68.64%.


0.6863543788187373

#### Random Search w/ Cross Validation

In [18]:
param_grid = {
    'C': [0.1, 1, 10, 100, 1000],
    'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
    'kernel': ['linear', 'poly', 'rbf']
    }

svc_final = GridSearchCV(
    estimator = SVC(),
    param_distributions = param_grid,
    n_iter = 100,
    cv = 3, 
    verbose = 3, 
    random_state = random_seed, 
    n_jobs = -1
)

svc_final.fit(X_train100, y_train100)

Fitting 3 folds for each of 75 candidates, totalling 225 fits




[CV 1/3] END .....C=0.1, gamma=1, kernel=linear;, score=0.603 total time=   0.1s
[CV 2/3] END .....C=0.1, gamma=1, kernel=linear;, score=0.602 total time=   0.1s
[CV 3/3] END .....C=0.1, gamma=1, kernel=linear;, score=0.602 total time=   0.1s
[CV 1/3] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.603 total time=   0.2s
[CV 2/3] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.602 total time=   0.1s
[CV 2/3] END ........C=0.1, gamma=1, kernel=rbf;, score=0.602 total time=   0.3s
[CV 1/3] END ........C=0.1, gamma=1, kernel=rbf;, score=0.603 total time=   0.3s
[CV 3/3] END ........C=0.1, gamma=1, kernel=rbf;, score=0.602 total time=   0.3s
[CV 3/3] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.602 total time=   0.1s
[CV 1/3] END .....C=0.1, gamma=0.1, kernel=poly;, score=0.607 total time=   0.2s
[CV 2/3] END .....C=0.1, gamma=0.1, kernel=poly;, score=0.602 total time=   0.1s
[CV 3/3] END .....C=0.1, gamma=0.1, kernel=poly;, score=0.607 total time=   0.1s
[CV 1/3] END ..C=0.1, gamma=

In [19]:
svc_random.best_params_

{'kernel': 'rbf', 'gamma': 0.1, 'C': 1}

In [21]:
# Save model
pickle.dump(svc_final, open('Models/svc_final.pkl', 'wb'))

## Evalution Models

| Models              | Test Accuracy | Test Recall | Test Precision | F1 Score |
|---------------------|---------------|-------------|----------------|----------|
| Logistic Regression |               |             |                |          |
| Random Forest       |               |             |                |          |
| K-NN                |               |             |                |          |
| Orazio              |               |             |                |          |