# Model-building phase: supervised approaches

In [46]:
import pandas as pd
import numpy as np
import random
from sklearn import preprocessing
from prep import *

## Data preprocessing

In preprocessing the data we make use of the *prep* function, which simultaneously allows us both to deal with the missing values, giving us the choice of removing them, or partially removing them by replacing the remainder with the mean or median of the corresponding variable, and to scale the data, with the possibility of choosing the method by which to scale such data from all the scalers in scikit-learn, by default the MinMaxScaler is set. The function then takes as input a pandas DataFrame and outputs a numpy ndarray containing the cleaned data from the previous dataset.

Our idea is to generate two datasets: the first by eliminating all observations having at least one component with a missing value, the second by eliminating only 50 percent of those observations. Eventually we will train each model using both datasets and collect their metrics in order to assess whether on average such a reduction in missing values to be eliminated (thus replacing the missing part) resulted in any benefit.

In [47]:
random.seed(13)
water = pd.read_csv('dataset/drinking_water_potability.csv')
water0 = prep( 
    df = water,
    axis='obs',
    perc=100,
    fill_method='mean',
    scaler= preprocessing.MinMaxScaler()
    )
water50 = prep(
    df = water,
    axis='obs',
    perc=50,
    fill_method='mean',
    scaler= preprocessing.MinMaxScaler()
)
water100 = prep(
    df = water,
    axis='obs',
    perc=0,
    fill_method='mean',
    scaler= preprocessing.MinMaxScaler()
)
print('original dataset size: ', water.shape, '- type: ', type(water))
print('cleaned dataset with all of missing values removed: ', np.shape(water0), '- type: ', type(water0))
print('cleaned dataset with 50% of missing values removed: ', np.shape(water50), '- type: ', type(water50))
print('cleaned dataset with 100% of missing values removed: ', np.shape(water100), '- type: ', type(water100))

original dataset size:  (3276, 10) - type:  <class 'pandas.core.frame.DataFrame'>
cleaned dataset with all of missing values removed:  (2011, 10) - type:  <class 'numpy.ndarray'>
cleaned dataset with 50% of missing values removed:  (2515, 10) - type:  <class 'numpy.ndarray'>
cleaned dataset with 100% of missing values removed:  (2993, 10) - type:  <class 'numpy.ndarray'>


At this point we proceed to divide the dataset into train set, validation set and test set. To do this, we make use of the *train_test_split()* function of scikit-learn.

In [48]:
X_train, y_train,X_val, X_test, y_val, y_test=splitting_func(water0)

BEFORE SPLITTING: 

X_water0 shape:  (2011, 8)
y_water0 shape:  (2011,)

AFTER SPLITTING: 
X_train0 shape:  (1206, 8)
X_val0 shape:  (402, 8)
X_test0 shape:  (403, 8)
y_train0 shape:  (1206,)
y_val0 shape:  (402,)
y_test0 shape:  (403,)


In [49]:
X_train, y_train,X_val, X_test, y_val, y_test=splitting_func(water50)

BEFORE SPLITTING: 

X_water0 shape:  (2515, 8)
y_water0 shape:  (2515,)

AFTER SPLITTING: 
X_train0 shape:  (1509, 8)
X_val0 shape:  (503, 8)
X_test0 shape:  (503, 8)
y_train0 shape:  (1509,)
y_val0 shape:  (503,)
y_test0 shape:  (503,)


In [50]:
X_train, y_train,X_val, X_test, y_val, y_test=splitting_func(water100)

BEFORE SPLITTING: 

X_water0 shape:  (2993, 8)
y_water0 shape:  (2993,)

AFTER SPLITTING: 
X_train0 shape:  (1795, 8)
X_val0 shape:  (599, 8)
X_test0 shape:  (599, 8)
y_train0 shape:  (1795,)
y_val0 shape:  (599,)
y_test0 shape:  (599,)
