# CSI4106 - Project 1

Group: P1-80

Roman Koval - 300082555

Noel Khalaf - 300079144

## Classification Empirical Study

### 1. Understand the classification task for your dataset

#### a. Is it a binary/multi-class classification?

It is a binary classification since the output has two possible options and can either be 1 for potable or 0 for non-potable water.


#### b. What is the goal? Is this for a particular application?

The goal of this dataset is to help water suppliers provide safe drinking water in order to reduce adverse health effects and yield a net economic benefit. Using this dataset as a training resource, users may develop means of advancing our technology for water potability prediction.

### 2. Analyze your dataset

**Importing libraries**

In [46]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import KBinsDiscretizer, StandardScaler
from sklearn.naive_bayes import CategoricalNB
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

#### a. Characterize the dataset in terms of number of training examples, number of features, missing data, etc.

In [47]:
# Loading the dataset
dataset = pd.read_csv("water_potability.csv")

In [48]:
dataset.shape

(3276, 10)

In [49]:
# Calculating the amount of empty cells
dataset.isnull().sum()

ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64

Initially we can see that our dataset has 10 columns and 3276 rows.

There are also many empty cells within the data, 491 empty ph cells for example. We must fill the gaps of these samples in our training and test sets in order to obtain a properly trained model.

Every empty cell will be filled with the mean value for that column.

In [50]:
dataset = dataset.dropna()

dataset['ph'].fillna(value=dataset['ph'].median(), inplace=True)
dataset['Sulfate'].fillna(value=dataset['Sulfate'].median(), inplace=True)
dataset['Trihalomethanes'].fillna(value=dataset['Trihalomethanes'].median(), inplace=True)

dataset.isnull().sum()

ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
Potability         0
dtype: int64

After running the `dropna()` function on our data set, all rows including missing feature data have been removed.

In [51]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2011 entries, 3 to 3271
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ph               2011 non-null   float64
 1   Hardness         2011 non-null   float64
 2   Solids           2011 non-null   float64
 3   Chloramines      2011 non-null   float64
 4   Sulfate          2011 non-null   float64
 5   Conductivity     2011 non-null   float64
 6   Organic_carbon   2011 non-null   float64
 7   Trihalomethanes  2011 non-null   float64
 8   Turbidity        2011 non-null   float64
 9   Potability       2011 non-null   int64  
dtypes: float64(9), int64(1)
memory usage: 172.8 KB


In [52]:
dataset.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity,Potability
3,8.316766,214.373394,22018.417441,8.059332,356.886136,363.266516,18.436524,100.341674,4.628771,0
4,9.092223,181.101509,17978.986339,6.5466,310.135738,398.410813,11.558279,31.997993,4.075075,0
5,5.584087,188.313324,28748.687739,7.544869,326.678363,280.467916,8.399735,54.917862,2.559708,0
6,10.223862,248.071735,28749.716544,7.513408,393.663396,283.651634,13.789695,84.603556,2.672989,0
7,8.635849,203.361523,13672.091764,4.563009,303.309771,474.607645,12.363817,62.798309,4.401425,0


Our resulting filtered data set now has the following attributes:

- **Features**: ph, Hardness, Solids, Chloramines, Conductivity, Organic_carbon, Trihalomethanes, Turbidity

- **Input Types**: All continuous

- **Classes**: Potable (1), Non-potable (0)

- **Output Type**: Discrete

- **Empty Data**: 491 + 781 + 162 = 1434 cells

- **Samples with Missing Data**: 3276 - 2011 = 1265 rows

- **Total Number of Samples**: 2011

Training and test data split into 4 folds.

- **Number of Training Samples**: 2011 * 0.75 = 1508

- **Number of Testing Samples**: 2011 * 0.25 = 503

### 3. Brainstorm about the attributes (Feature engineering)

#### a. Think about the features that could be useful for this task, are they all present in the dataset? Anything missing? Any attribute provided that doesn't seem useful to you?

All features provided within the dataset are extremely vital in determining water potability with one notable exception. Water hardness is classified by its mineral deposits, mainly calcium and magnesium. According to the World Health Organization (WHO), hard water has no known adverse health effect and in turn could be a supplement to calcium and magnesium intake. If the hardness rises to a very noticeable level, this will reflect in the solids feature of the dataset. An additional feature that may be useful is the colour of the water/liquid. This may be a very clear indication of potability at first glance assuming that food colouring has not been added to the water in question. With that in mind, the output may be skewed based on the colour, but the other features should be able to balance the skew enough to provide a fair resulting judgment by the model.

### 4. Encode the features

#### a. As you will use models that need discrete or continuous attributes, think about data encoding and transformation.

A scaler is used on all the inputs to normalize them. Converting the inputs from continuous to discrete is done during the train/test/evaluate cycle of the Naïve Bayes classifier.

In [53]:
X_vals = dataset.drop('Potability', axis=1)

scaler = StandardScaler()
scaler.fit(X_vals)

X = pd.DataFrame(scaler.transform(X_vals), columns=X_vals.columns)
X.head()

Unnamed: 0,ph,Hardness,Solids,Chloramines,Sulfate,Conductivity,Organic_carbon,Trihalomethanes,Turbidity
0,0.782466,0.564114,0.011687,0.583804,0.574378,-0.783962,1.227032,2.111652,0.844761
1,1.275463,-0.455653,-0.455835,-0.370947,-0.56048,-0.348429,-0.842154,-2.140399,0.135033
2,-0.954835,-0.234614,0.790645,0.259104,-0.158911,-1.810063,-1.79234,-0.714423,-1.807366
3,1.994902,1.596951,0.790764,0.239248,1.46714,-1.770608,-0.170876,1.132494,-1.662163
4,0.985323,0.226606,-0.954313,-1.622878,-0.726179,0.595858,-0.599824,-0.224135,0.553348


In [54]:
y = dataset.get('Potability').to_frame()
y.head()

Unnamed: 0,Potability
3,0
4,0
5,0
6,0
7,0


### 5. Prepare your data for the experiment, using cross-validation

The data is split into 4 folds. Each fold is stored in the `dataset_split` array. The different classifiers will then loop over this array and perform a round of train/test/evaluate.

In [55]:
kf = KFold(n_splits=4, shuffle=True, random_state=1)

dataset_split = []

for train_index, test_index in kf.split(X):
    X_train, X_test = X.values[train_index], X.values[test_index]
    y_train, y_test = y.values[train_index], y.values[test_index]
    dataset_split.append([X_train, y_train.ravel(), X_test, y_test.ravel()])

### 6-9. Train/test/evaluate three models

#### a. Naïve Bayes

##### Setup classifier

Naïve Bayes classifier that distributes the data into bins before training and testing. The number of bins and bin distribution strategy are used as the tunable parameters.

In [56]:
def classify_using_nb(num_bins, strategy):
    results = []

    for X_train, y_train, X_test, y_test in dataset_split:
        est = KBinsDiscretizer(n_bins=num_bins, encode="ordinal", strategy=strategy, random_state=1)
        est.fit(X.values)

        X_train_t = est.transform(X_train)
        X_test_t = est.transform(X_test)

        clf_nb = CategoricalNB(min_categories=num_bins)
        clf_nb.fit(X_train_t, y_train)

        y_pred = clf_nb.predict(X_test_t)

        report = classification_report(y_test, y_pred, output_dict=True, zero_division=0)
        tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
        result = {
            "accuracy": report["accuracy"],
            "precision_0": report['0']['precision'],
            "precision_1": report['1']['precision'],
            "recall_0": report['0']['recall'],
            "recall_1": report['1']['recall'],
            "tn": tn,
            "fp": fp,
            "fn": fn,
            "tp": tp
        }

        results.append(result)

    return pd.DataFrame(results)

##### Classify using default parameters

Classify using a Naïve Bayes with 5 bins for each feature and uniform bin width (all bins in each feature have identical widths).

In [57]:
results_nb_default = classify_using_nb(5, "uniform")
results_nb_default

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
0,0.60835,0.62963,0.478873,0.880259,0.175258,272,37,160,34
1,0.592445,0.597727,0.555556,0.90378,0.165094,263,28,177,35
2,0.624254,0.643902,0.537634,0.859935,0.255102,264,43,146,50
3,0.59761,0.604598,0.552239,0.897611,0.177033,263,30,172,37


In [58]:
results_nb_default.describe()

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.605665,0.618964,0.531076,0.885396,0.193122,265.5,34.5,163.75,39.0
std,0.014053,0.021549,0.035662,0.019679,0.041654,4.358899,6.855655,13.81726,7.438638
min,0.592445,0.597727,0.478873,0.859935,0.165094,263.0,28.0,146.0,34.0
25%,0.596319,0.60288,0.522944,0.875178,0.172717,263.0,29.5,156.5,34.75
50%,0.60298,0.617114,0.544937,0.888935,0.176146,263.5,33.5,166.0,36.0
75%,0.612326,0.633198,0.553068,0.899153,0.196551,266.0,38.5,173.25,40.25
max,0.624254,0.643902,0.555556,0.90378,0.255102,272.0,43.0,177.0,50.0


##### Classify with a greater amount of bins

Running Naïve Bayes with 50 bins for each feature and uniform bin width.

In [59]:
results_nb_1 = classify_using_nb(50, "uniform")
results_nb_1

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
0,0.5666,0.626741,0.416667,0.728155,0.309278,225,84,134,60
1,0.572565,0.597938,0.486957,0.797251,0.264151,232,59,156,56
2,0.568588,0.624309,0.425532,0.736156,0.306122,226,81,136,60
3,0.577689,0.609756,0.488722,0.767918,0.311005,225,68,144,65


In [60]:
results_nb_1.describe()

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.571361,0.614686,0.454469,0.75737,0.297639,227.0,73.0,142.5,60.25
std,0.004894,0.01345,0.038709,0.03165,0.022417,3.366502,11.633286,9.983319,3.685557
min,0.5666,0.597938,0.416667,0.728155,0.264151,225.0,59.0,134.0,56.0
25%,0.568091,0.606802,0.423316,0.734156,0.29563,225.0,65.75,135.5,59.0
50%,0.570577,0.617033,0.456244,0.752037,0.3077,225.5,74.5,140.0,60.0
75%,0.573846,0.624917,0.487398,0.775251,0.30971,227.5,81.75,147.0,61.25
max,0.577689,0.626741,0.488722,0.797251,0.311005,232.0,84.0,156.0,65.0


##### Classify with a different bin strategy

Running Naïve Bayes with 5 bins for each feature and quantile bin width (all bins in each feature have the same number of points).

In [61]:
results_nb_2 = classify_using_nb(5, "quantile")
results_nb_2

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
0,0.606362,0.639098,0.480769,0.825243,0.257732,255,54,144,50
1,0.596421,0.605769,0.551724,0.865979,0.226415,252,39,164,48
2,0.574553,0.622691,0.427419,0.76873,0.270408,236,71,143,53
3,0.611554,0.621891,0.57,0.853242,0.272727,250,43,152,57


In [62]:
results_nb_2.describe()

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.597222,0.622362,0.507478,0.828299,0.256821,248.25,51.75,150.75,52.0
std,0.016365,0.01361,0.065801,0.043204,0.021315,8.421203,14.314911,9.708244,3.91578
min,0.574553,0.605769,0.427419,0.76873,0.226415,236.0,39.0,143.0,48.0
25%,0.590954,0.61786,0.467432,0.811114,0.249903,246.5,42.0,143.75,49.5
50%,0.601392,0.622291,0.516247,0.839243,0.26407,251.0,48.5,148.0,51.5
75%,0.60766,0.626793,0.556293,0.856427,0.270988,252.75,58.25,155.0,54.0
max,0.611554,0.639098,0.57,0.865979,0.272727,255.0,71.0,164.0,57.0


#### b. Logistic Regression

##### Setup classifier

Logistic regression classifier. The solver algorithm used and the penalty type are used as the tunable parameters.

In [63]:
def classify_using_lr(solver, penalty):
    results = []

    for X_train, y_train, X_test, y_test in dataset_split:
        clf_lr = LogisticRegression(solver=solver, penalty=penalty, random_state=1)
        clf_lr.fit(X_train, y_train)

        y_pred = clf_lr.predict(X_test)

        report = classification_report(y_test, y_pred, output_dict=True, zero_division=0)
        tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
        result = {
            "accuracy": report["accuracy"],
            "precision_0": report['0']['precision'],
            "precision_1": report['1']['precision'],
            "recall_0": report['0']['recall'],
            "recall_1": report['1']['recall'],
            "tn": tn,
            "fp": fp,
            "fn": fn,
            "tp": tp
        }

        results.append(result)

    return pd.DataFrame(results)

##### Classify using default parameters

Classify using logistic regression with lbfgs solver and l2 penalty.

In [64]:
results_lr_default = classify_using_lr("lbfgs", "l2")
results_lr_default

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
0,0.594433,0.608247,0.222222,0.954693,0.020619,295,14,190,4
1,0.578529,0.578529,0.0,1.0,0.0,291,0,212,0
2,0.616302,0.614919,0.714286,0.993485,0.02551,305,2,191,5
3,0.583665,0.583665,0.0,1.0,0.0,293,0,209,0


In [65]:
results_lr_default.describe()

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.593232,0.59634,0.234127,0.987044,0.011532,296.0,4.0,200.5,2.25
std,0.016747,0.017934,0.336811,0.021785,0.013465,6.218253,6.733003,11.61895,2.629956
min,0.578529,0.578529,0.0,0.954693,0.0,291.0,0.0,190.0,0.0
25%,0.582381,0.582381,0.0,0.983787,0.0,292.5,0.0,190.75,0.0
50%,0.589049,0.595956,0.111111,0.996743,0.010309,294.0,1.0,200.0,2.0
75%,0.599901,0.609915,0.345238,1.0,0.021841,297.5,5.0,209.75,4.25
max,0.616302,0.614919,0.714286,1.0,0.02551,305.0,14.0,212.0,5.0


##### Classify with a different solver

Classify using logistic regression with liblinear solver and l2 penalty.

In [66]:
results_lr_1 = classify_using_lr("liblinear", "l2")
results_lr_1

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
0,0.592445,0.607438,0.210526,0.951456,0.020619,294,15,190,4
1,0.578529,0.578529,0.0,1.0,0.0,291,0,212,0
2,0.616302,0.614919,0.714286,0.993485,0.02551,305,2,191,5
3,0.583665,0.583665,0.0,1.0,0.0,293,0,209,0


In [67]:
results_lr_1.describe()

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.592735,0.596138,0.231203,0.986235,0.011532,295.75,4.25,200.5,2.25
std,0.016729,0.017758,0.337,0.023389,0.013465,6.291529,7.228416,11.61895,2.629956
min,0.578529,0.578529,0.0,0.951456,0.0,291.0,0.0,190.0,0.0
25%,0.582381,0.582381,0.0,0.982978,0.0,292.5,0.0,190.75,0.0
50%,0.588055,0.595552,0.105263,0.996743,0.010309,293.5,1.0,200.0,2.0
75%,0.59841,0.609308,0.336466,1.0,0.021841,296.75,5.25,209.75,4.25
max,0.616302,0.614919,0.714286,1.0,0.02551,305.0,15.0,212.0,5.0


##### Classify with a different penalty

Classify using logistic regression with liblinear solver and l1 penalty.

In [68]:
results_lr_2 = classify_using_lr("liblinear", "l1")
results_lr_2

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
0,0.594433,0.607803,0.1875,0.957929,0.015464,296,13,191,3
1,0.578529,0.578529,0.0,1.0,0.0,291,0,212,0
2,0.616302,0.614919,0.714286,0.993485,0.02551,305,2,191,5
3,0.583665,0.583665,0.0,1.0,0.0,293,0,209,0


In [69]:
results_lr_2.describe()

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.593232,0.596229,0.225446,0.987854,0.010244,296.25,3.75,200.75,2.0
std,0.016747,0.017837,0.337666,0.020185,0.012519,6.184658,6.238322,11.324752,2.44949
min,0.578529,0.578529,0.0,0.957929,0.0,291.0,0.0,191.0,0.0
25%,0.582381,0.582381,0.0,0.984596,0.0,292.5,0.0,191.0,0.0
50%,0.589049,0.595734,0.09375,0.996743,0.007732,294.5,1.0,200.0,1.5
75%,0.599901,0.609582,0.319196,1.0,0.017975,298.25,4.75,209.75,3.5
max,0.616302,0.614919,0.714286,1.0,0.02551,305.0,13.0,212.0,5.0


#### c. Multi-Layer Perceptron

##### Setup classifier

Multi-layer perceptron classifier. The hidden layer sizes, activation function and learning rate type are used as the tunable parameters.

In [70]:
def classify_using_mlp(hidden_layer_sizes, activation, learning_rate):
    results = []

    for X_train, y_train, X_test, y_test in dataset_split:        
        clf_mlp = MLPClassifier(
            solver="sgd",
            hidden_layer_sizes=hidden_layer_sizes,
            activation=activation,
            learning_rate=learning_rate,
            random_state=1,
            max_iter=9999,
        )
        clf_mlp.fit(X_train, y_train)

        y_pred = clf_mlp.predict(X_test)

        report = classification_report(y_test, y_pred, output_dict=True, zero_division=0)
        tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
        result = {
            "accuracy": report["accuracy"],
            "precision_0": report['0']['precision'],
            "precision_1": report['1']['precision'],
            "recall_0": report['0']['recall'],
            "recall_1": report['1']['recall'],
            "tn": tn,
            "fp": fp,
            "fn": fn,
            "tp": tp
        }

        results.append(result)

    return pd.DataFrame(results)

##### Classify using default parameters

Classify using multi-layer perceptron with a single 100 node hidden layer, rectified linear unit activation function and a constant learning rate.

In [71]:
results_mlp_default = classify_using_mlp((100,), "relu", "constant")
results_mlp_default

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
0,0.699801,0.702564,0.690265,0.886731,0.402062,274,35,116,78
1,0.654076,0.654354,0.653226,0.852234,0.382075,248,43,131,81
2,0.673956,0.694823,0.617647,0.830619,0.428571,255,52,112,84
3,0.701195,0.685714,0.752137,0.901024,0.421053,264,29,121,88


In [72]:
results_mlp_default.describe()

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.682257,0.684364,0.678319,0.867652,0.40844,260.25,39.75,120.0,82.75
std,0.02258,0.021159,0.057453,0.032077,0.020818,11.26573,9.979145,8.205689,4.272002
min,0.654076,0.654354,0.617647,0.830619,0.382075,248.0,29.0,112.0,78.0
25%,0.668986,0.677874,0.644331,0.84683,0.397065,253.25,33.5,115.0,80.25
50%,0.686879,0.690269,0.671746,0.869483,0.411557,259.5,39.0,118.5,82.5
75%,0.70015,0.696758,0.705733,0.890305,0.422932,266.5,45.25,123.5,85.0
max,0.701195,0.702564,0.752137,0.901024,0.428571,274.0,52.0,131.0,88.0


##### Classify with a different activation function

Classify using multi-layer perceptron with a single 100 node hidden layer, logistic activation function and a constant learning rate.

In [73]:
results_mlp_1 = classify_using_mlp((100,), "logistic", "constant")
results_mlp_1

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
0,0.614314,0.614314,0.0,1.0,0.0,309,0,194,0
1,0.578529,0.578529,0.0,1.0,0.0,291,0,212,0
2,0.610338,0.610338,0.0,1.0,0.0,307,0,196,0
3,0.583665,0.583665,0.0,1.0,0.0,293,0,209,0


In [74]:
results_mlp_1.describe()

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.596712,0.596712,0.0,1.0,0.0,300.0,0.0,202.75,0.0
std,0.018224,0.018224,0.0,0.0,0.0,9.309493,0.0,9.069179,0.0
min,0.578529,0.578529,0.0,1.0,0.0,291.0,0.0,194.0,0.0
25%,0.582381,0.582381,0.0,1.0,0.0,292.5,0.0,195.5,0.0
50%,0.597002,0.597002,0.0,1.0,0.0,300.0,0.0,202.5,0.0
75%,0.611332,0.611332,0.0,1.0,0.0,307.5,0.0,209.75,0.0
max,0.614314,0.614314,0.0,1.0,0.0,309.0,0.0,212.0,0.0


##### Classify with a different learning rate

Classify using multi-layer perceptron with a single 100 node hidden layer, logistic activation function and an inverse scaling learning rate.

In [75]:
results_mlp_2 = classify_using_mlp((100,), "logistic", "invscaling")
results_mlp_2

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
0,0.614314,0.614314,0.0,1.0,0.0,309,0,194,0
1,0.578529,0.578529,0.0,1.0,0.0,291,0,212,0
2,0.610338,0.610338,0.0,1.0,0.0,307,0,196,0
3,0.583665,0.583665,0.0,1.0,0.0,293,0,209,0


In [76]:
results_mlp_2.describe()

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.596712,0.596712,0.0,1.0,0.0,300.0,0.0,202.75,0.0
std,0.018224,0.018224,0.0,0.0,0.0,9.309493,0.0,9.069179,0.0
min,0.578529,0.578529,0.0,1.0,0.0,291.0,0.0,194.0,0.0
25%,0.582381,0.582381,0.0,1.0,0.0,292.5,0.0,195.5,0.0
50%,0.597002,0.597002,0.0,1.0,0.0,300.0,0.0,202.5,0.0
75%,0.611332,0.611332,0.0,1.0,0.0,307.5,0.0,209.75,0.0
max,0.614314,0.614314,0.0,1.0,0.0,309.0,0.0,212.0,0.0


##### Classify with different hidden layers

Classify using multi-layer perceptron with 3 hidden layers with 19, 16 and 13 nodes, rectified linear unit activation function and a constant learning rate.

In [77]:
results_mlp_3 = classify_using_mlp((19, 16, 13), "relu", "constant")
results_mlp_3

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
0,0.683897,0.723214,0.60479,0.786408,0.520619,243,66,93,101
1,0.658052,0.673469,0.625,0.793814,0.471698,231,60,112,100
2,0.664016,0.705357,0.580838,0.771987,0.494898,237,70,99,97
3,0.693227,0.703812,0.670807,0.819113,0.516746,240,53,101,108


In [78]:
results_mlp_3.describe()

Unnamed: 0,accuracy,precision_0,precision_1,recall_0,recall_1,tn,fp,fn,tp
count,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
mean,0.674798,0.701463,0.620359,0.79283,0.50099,237.75,62.25,101.25,101.5
std,0.016524,0.020635,0.03817,0.019727,0.022573,5.123475,7.410578,7.932003,4.654747
min,0.658052,0.673469,0.580838,0.771987,0.471698,231.0,53.0,93.0,97.0
25%,0.662525,0.696227,0.598802,0.782803,0.489098,235.5,58.25,97.5,99.25
50%,0.673956,0.704585,0.614895,0.790111,0.505822,238.5,63.0,100.0,100.5
75%,0.686229,0.709821,0.636452,0.800139,0.517714,240.75,67.0,103.75,102.75
max,0.693227,0.723214,0.670807,0.819113,0.520619,243.0,70.0,112.0,108.0


### 10. Analyze the obtained results

#### a. Comparing the 10 Results

##### Setup

In [79]:
all_results = [results_nb_default, results_nb_1, results_nb_2, results_lr_default, results_lr_1, results_lr_2, results_mlp_default, results_mlp_1, results_mlp_2, results_mlp_3]

def analyze_pr_results(all_results, analysis):
    analyzed_results = {}
    i=1
    for results in all_results:
         analyzed_results[analysis+str(i)] = results.describe().loc[analysis][["precision_0","precision_1","recall_0","recall_1"]]
         i += 1
    return pd.DataFrame(analyzed_results)

def analyze_matrix_results(all_results, analysis):
    analyzed_results = {}
    i=1
    if analysis == "sum":
        analyzed_results["tn"] = []
        analyzed_results["fp"] = []
        analyzed_results["fn"] = []
        analyzed_results["tp"] = []
        for results in all_results:
            analyzed_results["tn"].append(results["tn"].sum())
            analyzed_results["fp"].append(results["fp"].sum())
            analyzed_results["fn"].append(results["fn"].sum())
            analyzed_results["tp"].append(results["tp"].sum())
            i += 1
        return pd.DataFrame.from_dict(analyzed_results)
    else:
        for results in all_results:
            analyzed_results[analysis+str(i)] = results.describe().loc[analysis][["tn","fp","fn","tp"]]
            i += 1
        return pd.DataFrame(analyzed_results)

##### Obtained Values

Below are the obtained mean, standard deviation, minimum, and maximum values for the precisions and recalls from all 10 results obtained in sections 6-9. Numbers 1-3 correspond to the Naïve Bayes results, 4-6 to Logistic Regression, and 7-10 to the Multi-Layer Perceptron.

In [80]:
analyze_pr_results(all_results, "mean")

Unnamed: 0,mean1,mean2,mean3,mean4,mean5,mean6,mean7,mean8,mean9,mean10
precision_0,0.618964,0.614686,0.622362,0.59634,0.596138,0.596229,0.684364,0.596712,0.596712,0.701463
precision_1,0.531076,0.454469,0.507478,0.234127,0.231203,0.225446,0.678319,0.0,0.0,0.620359
recall_0,0.885396,0.75737,0.828299,0.987044,0.986235,0.987854,0.867652,1.0,1.0,0.79283
recall_1,0.193122,0.297639,0.256821,0.011532,0.011532,0.010244,0.40844,0.0,0.0,0.50099


In [81]:
analyze_pr_results(all_results, "std")

Unnamed: 0,std1,std2,std3,std4,std5,std6,std7,std8,std9,std10
precision_0,0.021549,0.01345,0.01361,0.017934,0.017758,0.017837,0.021159,0.018224,0.018224,0.020635
precision_1,0.035662,0.038709,0.065801,0.336811,0.337,0.337666,0.057453,0.0,0.0,0.03817
recall_0,0.019679,0.03165,0.043204,0.021785,0.023389,0.020185,0.032077,0.0,0.0,0.019727
recall_1,0.041654,0.022417,0.021315,0.013465,0.013465,0.012519,0.020818,0.0,0.0,0.022573


In [82]:
analyze_pr_results(all_results, "min")

Unnamed: 0,min1,min2,min3,min4,min5,min6,min7,min8,min9,min10
precision_0,0.597727,0.597938,0.605769,0.578529,0.578529,0.578529,0.654354,0.578529,0.578529,0.673469
precision_1,0.478873,0.416667,0.427419,0.0,0.0,0.0,0.617647,0.0,0.0,0.580838
recall_0,0.859935,0.728155,0.76873,0.954693,0.951456,0.957929,0.830619,1.0,1.0,0.771987
recall_1,0.165094,0.264151,0.226415,0.0,0.0,0.0,0.382075,0.0,0.0,0.471698


In [83]:
analyze_pr_results(all_results, "max")

Unnamed: 0,max1,max2,max3,max4,max5,max6,max7,max8,max9,max10
precision_0,0.643902,0.626741,0.639098,0.614919,0.614919,0.614919,0.702564,0.614314,0.614314,0.723214
precision_1,0.555556,0.488722,0.57,0.714286,0.714286,0.714286,0.752137,0.0,0.0,0.670807
recall_0,0.90378,0.797251,0.865979,1.0,1.0,1.0,0.901024,1.0,1.0,0.819113
recall_1,0.255102,0.311005,0.272727,0.02551,0.02551,0.02551,0.428571,0.0,0.0,0.520619


##### Discussion

The obtained values above allow us to determine which out of the 10 varieties of trained models is better in which scenario. They also allow us to spot any signs of overfitting and skew towards either of the two possible predictions.<br><br>
We can see that, in terms of precision for non-potable water (precision_0), the Multi-Layer Perceptron with added hidden layers (model 10) vastly outperformed the other models. The minimum, maximum, and mean values are all greater than the other nine models. The standard deviation is however noticeably wide and the third highest among the group. Although this spread is large, the high minimum value means that it will still outperform the other models for non-potable precision based on our configurations. The worst model for non_potable precision on average is the Logistic Regression with the liblinear solver and L2 penalty (model 5).<br>
For the precision of potable water (precision_1), the Multi-Layer Perceptron with default configurations (model 7) obtained the best results. Taking these results into account, it seems that the Multi-Layer Perceptron is the best type of model when it comes to optimizing for higher precision. However, the 8th and 9th models from the Multi-Layer Perceptron have a zero precision value. This is due to its true positive (tp) count being zero, meaning these two models never correctly predicted the water being potable. It can also be observed that the water has never been predicted to be potable when for these two Multi-Layer Perceptron models. This could be due to incorrect model training or the data itself not focusing enough on potable water as opposed to non-potable water.<br>
In terms of recall, the Multi-Layer Perceptron models 7 and 8 have perfect scores for potable water (recall_0). This is caused by our fp value being 0 for these models. Assuming this is incorrect, the next best model would be the Logistic Regression with liblinear solver and L1 penalty (model 6). It seems changing from L2 to L1 penalty reduced the precision, yet increased the recall.<br>
Finally, the non-potable water recall (recall_1) has the Multi-Layer Perceptron with added hidden layers (model 10) as the dominant model. All other models seem to struggle to obtain a recall_1 value greater than 0.5 judging by their maximum values. Furthermore, a few of the other models have a 0 recall for this category which indicates no true positives (tp) present. This is validated by comparing with the corresponding precision which also has a value of 0.<br><br>
After comparing the precision and recall values of the various models and their configurations, it is evident that the Multi-Layer Perceptron with added hidden layers (model 10) provides the best overall results for predicting the potability of water.

#### b. Good and Bad Examples

##### Obtained Values

Below are the obtained mean, standard deviation, minimum, maximum, and sum values for the true negative (tn), false positive (fp), false negative (fn), and true positive (tp) from all 10 results obtained in sections 6-9. The same numbering system applies with 1-3 corresponding to the Naïve Bayes results, 4-6 to Logistic Regression, and 7-10 to the Multi-Layer Perceptron.

In [84]:
analyze_matrix_results(all_results, "mean")

Unnamed: 0,mean1,mean2,mean3,mean4,mean5,mean6,mean7,mean8,mean9,mean10
tn,265.5,227.0,248.25,296.0,295.75,296.25,260.25,300.0,300.0,237.75
fp,34.5,73.0,51.75,4.0,4.25,3.75,39.75,0.0,0.0,62.25
fn,163.75,142.5,150.75,200.5,200.5,200.75,120.0,202.75,202.75,101.25
tp,39.0,60.25,52.0,2.25,2.25,2.0,82.75,0.0,0.0,101.5


In [85]:
analyze_matrix_results(all_results, "std")

Unnamed: 0,std1,std2,std3,std4,std5,std6,std7,std8,std9,std10
tn,4.358899,3.366502,8.421203,6.218253,6.291529,6.184658,11.26573,9.309493,9.309493,5.123475
fp,6.855655,11.633286,14.314911,6.733003,7.228416,6.238322,9.979145,0.0,0.0,7.410578
fn,13.81726,9.983319,9.708244,11.61895,11.61895,11.324752,8.205689,9.069179,9.069179,7.932003
tp,7.438638,3.685557,3.91578,2.629956,2.629956,2.44949,4.272002,0.0,0.0,4.654747


In [86]:
analyze_matrix_results(all_results, "min")

Unnamed: 0,min1,min2,min3,min4,min5,min6,min7,min8,min9,min10
tn,263.0,225.0,236.0,291.0,291.0,291.0,248.0,291.0,291.0,231.0
fp,28.0,59.0,39.0,0.0,0.0,0.0,29.0,0.0,0.0,53.0
fn,146.0,134.0,143.0,190.0,190.0,191.0,112.0,194.0,194.0,93.0
tp,34.0,56.0,48.0,0.0,0.0,0.0,78.0,0.0,0.0,97.0


In [87]:
analyze_matrix_results(all_results, "max")

Unnamed: 0,max1,max2,max3,max4,max5,max6,max7,max8,max9,max10
tn,272.0,232.0,255.0,305.0,305.0,305.0,274.0,309.0,309.0,243.0
fp,43.0,84.0,71.0,14.0,15.0,13.0,52.0,0.0,0.0,70.0
fn,177.0,156.0,164.0,212.0,212.0,212.0,131.0,212.0,212.0,112.0
tp,50.0,65.0,57.0,5.0,5.0,5.0,88.0,0.0,0.0,108.0


In [88]:
analyze_matrix_results(all_results, "sum")

Unnamed: 0,tn,fp,fn,tp
0,1062,138,655,156
1,908,292,570,241
2,993,207,603,208
3,1184,16,802,9
4,1183,17,802,9
5,1185,15,803,8
6,1041,159,480,331
7,1200,0,811,0
8,1200,0,811,0
9,951,249,405,406


##### Discussion

The true negative (tn), false positive (fp), false negative (fn), and true positive (tp) values obtained above allow us to derive the precision and recall values discussed in part a of this section.<br><br>
Since around 60% of samples in our data set have the output being non-potable, the training data is skewed towards guessing that the water is non-potable. This is evident by viewing the sums of fp and tp which represent the models predicting potable. Their sums are noticeably lower than their tn and fn counterparts which predict that the water is non-potable. The vast majority of the predictions are of non-potable of tn and fn which then leans more in the direction of tn. This means that our models, on average, are most likely to assume non-potable and be correct with non-potable. Although a substantial amount of predictions of non-potable end up being potable, as shown in fn, this is a much safer prediction than having the reverse where the models would predict that the water is safe to consume and it is not. We can also notice that the fp and tp values for Logistic Regression and a few Multi-Level Perceptron are extremely low with some even reaching 0. This shows that the model were very hesitant at predicting that the water is potable.<br><br>
In summary, the models with a more even spread of metrics outperformed the varied and skewed models by a large margin. Although the true positive values are still low in models 1-3, 7, and 10, the true negatives make up for it especially since this is a binary output. Using our configurations, the Naïve Bayes is the most balanced model, the Logistic Regression is consistently low performing, and the Multi-Layer Perceptron is on both ends of the extreme with its custom hidden layers being the best performing model out of all.

## References

- https://www.kaggle.com/code/hossamgalal68/water-quality-prediction-with-ann-dl-eda/notebook

- https://www2.cambridgema.gov/CityOfCambridge_Content/documents/Drinking%20WaterMy%20edition.pdf

- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3775162/

- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html

- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.CategoricalNB.html

- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html

- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

- https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html