# CSC5610 Final - Water Potability Analysis (Continued)

Authors: **Jacob Buysse**, **Andrew Cook**, **Josh Grant**

## Part 2 - Initial Regression Analysis

For this we will be using the following libraries...

In [None]:
import pandas as pd
import scipy.sparse as sp
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Now let us define a helper function for scoring our classifier models.  We will show the confusion matrix results (explained) and the classification report which includes accuracy.

In [None]:
def score_model(test_y, pred_y):
    tn, fp, fn, tp = confusion_matrix(test_y, pred_y).ravel()
    print(f"True Positive {tp} (Correctly classified potable)")
    print(f"True Negative: {tn} (Correctly classified non-potable)")
    print(f"False Positive {fp} (Incorrectly classified non-potable as potable - Bad)")
    print(f"False Negative {fn} (Incorrectly classified potable as non-potable - Meh)")
    print(classification_report(test_y, pred_y, digits=5))

Let us load our feather file and prepare the data for regression.

In [None]:
df = pd.read_feather("./potability.feather")
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 334280 entries, 0 to 334279
Columns: 438 entries, station_id to potable
dtypes: float64(434), int64(2), object(2)
memory usage: 1.1+ GB


Let us split our data into train/test datasets stratified over `potable` with a 75/25 split.

In [None]:
df.potable = df.potable.astype("category")
train_df, test_df = train_test_split(df, train_size=0.75, random_state=777, stratify=df.potable)
print(f"Train size: {train_df.shape}, Test size: {test_df.shape}")

Train size: (250710, 438), Test size: (83570, 438)


Now let us scale our numeric features based on the training set.

In [None]:
scaler = StandardScaler()
num_features = df.columns[df.columns.str.contains("param_")].values.tolist()
scaler.fit(train_df[num_features])
train_num = scaler.transform(train_df[num_features])
test_num = scaler.transform(test_df[num_features])

We only have numeric features for now, but let us get our train/test X/y values.

In [None]:
train_X = train_num
train_y = train_df.potable.values
test_X = test_num
test_y = test_df.potable.values
print(f"Train X {train_X.shape}, y {len(train_y)}")
print(f"Test X {test_X.shape}, y {len(test_y)}")

Train X (250710, 432), y 250710
Test X (83570, 432), y 83570


Let us try Logistic Regression

In [None]:
model = LogisticRegression(random_state=777, max_iter=10000)
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
score_model(test_y, pred_y)

True Positive 77271 (Correctly classified potable)
True Negative: 4027 (Correctly classified non-potable)
False Positive 1975 (Incorrectly classified non-potable as potable - Bad)
False Negative 297 (Incorrectly classified potable as non-potable - Meh)
              precision    recall  f1-score   support

           0    0.93131   0.67094   0.77997      6002
           1    0.97508   0.99617   0.98551     77568

    accuracy                        0.97281     83570
   macro avg    0.95320   0.83356   0.88274     83570
weighted avg    0.97193   0.97281   0.97075     83570



The logistic regression produced a non-potable recall of 67% and an f1-score of 77%.  Not too bad.

Let us try a Decision Tree classifier.

In [None]:
model = DecisionTreeClassifier(random_state=777)
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
score_model(test_y, pred_y)

True Positive 77551 (Correctly classified potable)
True Negative: 5983 (Correctly classified non-potable)
False Positive 19 (Incorrectly classified non-potable as potable - Bad)
False Negative 17 (Incorrectly classified potable as non-potable - Meh)
              precision    recall  f1-score   support

           0    0.99717   0.99683   0.99700      6002
           1    0.99976   0.99978   0.99977     77568

    accuracy                        0.99957     83570
   macro avg    0.99846   0.99831   0.99838     83570
weighted avg    0.99957   0.99957   0.99957     83570



The decision tree model worked very well - only incorrectly classifying 36 samples.

Let us try an SGD classifier with loss = hinge and log_loss (separately).

In [None]:
model = SGDClassifier(random_state=777, loss="hinge")
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
score_model(test_y, pred_y)

True Positive 77288 (Correctly classified potable)
True Negative: 3800 (Correctly classified non-potable)
False Positive 2202 (Incorrectly classified non-potable as potable - Bad)
False Negative 280 (Incorrectly classified potable as non-potable - Meh)
              precision    recall  f1-score   support

           0    0.93137   0.63312   0.75382      6002
           1    0.97230   0.99639   0.98420     77568

    accuracy                        0.97030     83570
   macro avg    0.95184   0.81476   0.86901     83570
weighted avg    0.96936   0.97030   0.96765     83570



In [None]:
model = SGDClassifier(random_state=777, loss="log_loss")
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
score_model(test_y, pred_y)

True Positive 77168 (Correctly classified potable)
True Negative: 4014 (Correctly classified non-potable)
False Positive 1988 (Incorrectly classified non-potable as potable - Bad)
False Negative 400 (Incorrectly classified potable as non-potable - Meh)
              precision    recall  f1-score   support

           0    0.90938   0.66878   0.77074      6002
           1    0.97489   0.99484   0.98476     77568

    accuracy                        0.97143     83570
   macro avg    0.94213   0.83181   0.87775     83570
weighted avg    0.97018   0.97143   0.96939     83570



They both did okay, but much worse than the decision tree.

Let us try a Random Forest classifier.

In [None]:
model = RandomForestClassifier(random_state=777)
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
score_model(test_y, pred_y)

True Positive 77549 (Correctly classified potable)
True Negative: 5896 (Correctly classified non-potable)
False Positive 106 (Incorrectly classified non-potable as potable - Bad)
False Negative 19 (Incorrectly classified potable as non-potable - Meh)
              precision    recall  f1-score   support

           0    0.99679   0.98234   0.98951      6002
           1    0.99863   0.99976   0.99919     77568

    accuracy                        0.99850     83570
   macro avg    0.99771   0.99105   0.99435     83570
weighted avg    0.99850   0.99850   0.99850     83570



This model did well.  Slightly worse than the Decision Tree but much better than the SGD classifier.

## Excluding Features

These methods necessarily suffer from label leakage since we computed label off of an algorithm using the very same features.  These models are really just approximating our algorithm at this point.  So let us exclude some of the features we used and re-train the models.  This will tell us if there are relationships between different measurements that could be used to indirectly test the potability of the water.

To start, let us exclude the top 5 features that failed the potability test.

In [None]:
scaler = StandardScaler()
num_features = df.columns[df.columns.str.contains("param_")].values.tolist()
num_features.remove("param_Dissolved Boron")
num_features.remove("param_Dissolved Nitrate")
num_features.remove("param_Total Manganese")
num_features.remove("param_Dissolved Fluoride")
num_features.remove("param_Bromodichloromethane")
scaler.fit(train_df[num_features])
train_X = scaler.transform(train_df[num_features])
test_X = scaler.transform(test_df[num_features])

Let us try a logistic regression.

In [None]:
model = LogisticRegression(random_state=777, max_iter=10000)
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
score_model(test_y, pred_y)

True Positive 77387 (Correctly classified potable)
True Negative: 1516 (Correctly classified non-potable)
False Positive 4486 (Incorrectly classified non-potable as potable - Bad)
False Negative 181 (Incorrectly classified potable as non-potable - Meh)
              precision    recall  f1-score   support

           0    0.89334   0.25258   0.39382      6002
           1    0.94521   0.99767   0.97073     77568

    accuracy                        0.94415     83570
   macro avg    0.91927   0.62512   0.68227     83570
weighted avg    0.94148   0.94415   0.92930     83570



The non-potable recall dropped to 25% and the f1-score down to 39%.

Let us try a Decision Tree classifier.

In [None]:
model = DecisionTreeClassifier(random_state=777)
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
score_model(test_y, pred_y)

True Positive 75256 (Correctly classified potable)
True Negative: 3795 (Correctly classified non-potable)
False Positive 2207 (Incorrectly classified non-potable as potable - Bad)
False Negative 2312 (Incorrectly classified potable as non-potable - Meh)
              precision    recall  f1-score   support

           0    0.62142   0.63229   0.62681      6002
           1    0.97151   0.97019   0.97085     77568

    accuracy                        0.94593     83570
   macro avg    0.79646   0.80124   0.79883     83570
weighted avg    0.94637   0.94593   0.94614     83570



We can see that while we still have an accuracy of 94%, our f1-score for non-potable dropped to 62% (which is kind of the point of the model).  So we still have some significance but it drops our accuracy below what we should probably be relying on for potability.

Now let us try removing all features that failed the potability test and see how accurate we can get.

In [None]:
scaler = StandardScaler()
num_features = df.columns[df.columns.str.contains("param_")].values.tolist()
num_features.remove("param_1,2-Dibromo-3-chloropropane (DBCP)")
num_features.remove("param_BHC-gamma (Lindane)")
num_features.remove("param_Bromodichloromethane")
num_features.remove("param_Bromoform")
num_features.remove("param_Carbon tetrachloride")
num_features.remove("param_Chlordane")
num_features.remove("param_Chloroform")
num_features.remove("param_Cyanazine")
num_features.remove("param_DDT (all isomers)")
num_features.remove("param_Dibromochloromethane")
num_features.remove("param_Dissolved Antimony")
num_features.remove("param_Dissolved Arsenic")
num_features.remove("param_Dissolved Barium")
num_features.remove("param_Dissolved Boron")
num_features.remove("param_Dissolved Cadmium")
num_features.remove("param_Dissolved Chromium")
num_features.remove("param_Dissolved Copper")
num_features.remove("param_Dissolved Fluoride")
num_features.remove("param_Dissolved Lead")
num_features.remove("param_Dissolved Manganese")
num_features.remove("param_Dissolved Mercury")
num_features.remove("param_Dissolved Nickel")
num_features.remove("param_Dissolved Nitrate")
num_features.remove("param_Dissolved Nitrate + Nitrite")
num_features.remove("param_Dissolved Selenium")
# num_features.remove("param_Dissolved Uranium") # Excluded due to missing lat/lon information
num_features.remove("param_Endrin")
num_features.remove("param_Molinate")
num_features.remove("param_Pentachlorophenol (PCP)")
num_features.remove("param_Simazine")
num_features.remove("param_Total Antimony")
num_features.remove("param_Total Arsenic")
num_features.remove("param_Total Barium")
num_features.remove("param_Total Cadmium")
num_features.remove("param_Total Chromium")
num_features.remove("param_Total Copper")
num_features.remove("param_Total Lead")
num_features.remove("param_Total Manganese")
num_features.remove("param_Total Mercury")
num_features.remove("param_Total Nickel")
num_features.remove("param_Total Selenium")
scaler.fit(train_df[num_features])
train_X = scaler.transform(train_df[num_features])
test_X = scaler.transform(test_df[num_features])

Let us try a logistic regression.

In [None]:
model = LogisticRegression(random_state=777, max_iter=10000)
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
score_model(test_y, pred_y)

True Positive 77324 (Correctly classified potable)
True Negative: 489 (Correctly classified non-potable)
False Positive 5513 (Incorrectly classified non-potable as potable - Bad)
False Negative 244 (Incorrectly classified potable as non-potable - Meh)
              precision    recall  f1-score   support

           0    0.66712   0.08147   0.14521      6002
           1    0.93345   0.99685   0.96411     77568

    accuracy                        0.93111     83570
   macro avg    0.80028   0.53916   0.55466     83570
weighted avg    0.91432   0.93111   0.90530     83570



The recall dropped to 8% and the f1-score all the way down to 14%.

Let us try a Decision Tree classifier.

In [None]:
model = DecisionTreeClassifier(random_state=777)
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
score_model(test_y, pred_y)

True Positive 74576 (Correctly classified potable)
True Negative: 3175 (Correctly classified non-potable)
False Positive 2827 (Incorrectly classified non-potable as potable - Bad)
False Negative 2992 (Incorrectly classified potable as non-potable - Meh)
              precision    recall  f1-score   support

           0    0.51484   0.52899   0.52182      6002
           1    0.96348   0.96143   0.96245     77568

    accuracy                        0.93037     83570
   macro avg    0.73916   0.74521   0.74213     83570
weighted avg    0.93126   0.93037   0.93080     83570



This dropped our non-potable f1-score down to 52%.  This is still better than a coin flip.  Note that the recall is also 52% for non-potable.

Let us see if a Random Forest can do better.

In [None]:
model = RandomForestClassifier(random_state=777)
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
score_model(test_y, pred_y)

True Positive 77122 (Correctly classified potable)
True Negative: 2532 (Correctly classified non-potable)
False Positive 3470 (Incorrectly classified non-potable as potable - Bad)
False Negative 446 (Incorrectly classified potable as non-potable - Meh)
              precision    recall  f1-score   support

           0    0.85024   0.42186   0.56392      6002
           1    0.95694   0.99425   0.97524     77568

    accuracy                        0.95314     83570
   macro avg    0.90359   0.70805   0.76958     83570
weighted avg    0.94928   0.95314   0.94570     83570



Well, it takes a while to train, but it improves the non-potable f1-score to 56%.