# CSC5610 Final - Water Potability Analysis (Continued)

Authors: **Jacob Buysse**, **Andrew Cook**, **Josh Grant**

## Part 5 - Complementary Data and Feature Engineering

For this we will be using the following libraries...

In [1]:
import pandas as pd
import scipy.sparse as sp
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

Now let us define a helper function for scoring our classifier models.  We will show the confusion matrix results (explained) and the classification report which includes accuracy.

In [2]:
def score_model(test_y, pred_y):
    tn, fp, fn, tp = confusion_matrix(test_y, pred_y).ravel()
    print(f"True Positive {tp} (Correctly classified potable)")
    print(f"True Negative: {tn} (Correctly classified non-potable)")
    print(f"False Positive {fp} (Incorrectly classified non-potable as potable - Bad)")
    print(f"False Negative {fn} (Incorrectly classified potable as non-potable - Meh)")
    print(classification_report(test_y, pred_y, digits=5))

Let us load our sample feather and county feather files and merge the two.

In [3]:
sdf = pd.read_feather("./potability.feather")
sdf.head()

Unnamed: 0,station_id,latitude,longitude,county_name,sample_code,param_(Aminomethyl)phosphonic acid,"param_1,1,1,2-Tetrachloroethane","param_1,1,1-Trichloroethane","param_1,1,2,2-Tetrachloroethane","param_1,1,2-Trichloroethane",...,"param_p,p'-DDE","param_p,p'-DDT",param_p-Xylene,param_pH,"param_s,s,s-Tributyl Phosphorotrithioate (DEF)",param_sec-Butylbenzene,param_tert-Butylbenzene,"param_trans-1,2-Dichloroethene","param_trans-1,3-Dichloropropene",potable
0,1,38.5596,-121.4169,Sacramento,C0114B0005,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,1,38.5596,-121.4169,Sacramento,C0115B0005,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,1,38.5596,-121.4169,Sacramento,C0116B0005,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,1,38.5596,-121.4169,Sacramento,C0117B0081,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,1,38.5596,-121.4169,Sacramento,C0118B0005,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [4]:
cdf = pd.read_feather("./counties.feather")
cdf.head()

Unnamed: 0,CountyName,AdminRegion,FireMAR,LawMAR,Shape__Area,Shape__Length,emp,qp1,ap,est,...,n50_99,n100_249,n250_499,n500_999,n1000,n1000_1,n1000_2,n1000_3,n1000_4,Population
0,Alameda,Coastal,2,2,3082165000.0,432059.180712,19808892,334583251,1276862745,1197073,...,32312.0,17613.0,3388.0,808.0,644.0,94.0,135.0,67.0,11.0,1628997
1,Alpine,Inland,4,4,3145871000.0,274621.12106,25372729,358615051,1326420707,1473927,...,42557.0,23000.0,4396.0,1317.0,1079.0,289.0,280.0,166.0,59.0,1190
2,Amador,Inland,4,4,2559998000.0,357482.565247,14140321,185837886,727112583,961824,...,21549.0,11581.0,2214.0,563.0,309.0,83.0,56.0,10.0,0.0,41412
3,Butte,Inland,3,3,7338660000.0,526729.272631,7085896,94172636,348842612,477062,...,10146.0,5091.0,858.0,234.0,174.0,48.0,28.0,0.0,0.0,207303
4,Calaveras,Inland,4,4,4352160000.0,371781.055548,8547832,105647215,421612466,591560,...,12189.0,6625.0,952.0,283.0,201.0,59.0,34.0,4.0,0.0,46563


In [5]:
df = sdf.merge(cdf, how="left", left_on="county_name", right_on="CountyName")
df.head()

Unnamed: 0,station_id,latitude,longitude,county_name,sample_code,param_(Aminomethyl)phosphonic acid,"param_1,1,1,2-Tetrachloroethane","param_1,1,1-Trichloroethane","param_1,1,2,2-Tetrachloroethane","param_1,1,2-Trichloroethane",...,n50_99,n100_249,n250_499,n500_999,n1000,n1000_1,n1000_2,n1000_3,n1000_4,Population
0,1,38.5596,-121.4169,Sacramento,C0114B0005,0.0,0.0,0.0,0.0,0.0,...,19033.0,9932.0,1738.0,478.0,297.0,58.0,33.0,37.0,9.0,1584169
1,1,38.5596,-121.4169,Sacramento,C0115B0005,0.0,0.0,0.0,0.0,0.0,...,19033.0,9932.0,1738.0,478.0,297.0,58.0,33.0,37.0,9.0,1584169
2,1,38.5596,-121.4169,Sacramento,C0116B0005,0.0,0.0,0.0,0.0,0.0,...,19033.0,9932.0,1738.0,478.0,297.0,58.0,33.0,37.0,9.0,1584169
3,1,38.5596,-121.4169,Sacramento,C0117B0081,0.0,0.0,0.0,0.0,0.0,...,19033.0,9932.0,1738.0,478.0,297.0,58.0,33.0,37.0,9.0,1584169
4,1,38.5596,-121.4169,Sacramento,C0118B0005,0.0,0.0,0.0,0.0,0.0,...,19033.0,9932.0,1738.0,478.0,297.0,58.0,33.0,37.0,9.0,1584169


Let us do a 75/25 train/test split stratified by potability.

In [6]:
df.potable = df.potable.astype("category")
train_df, test_df = train_test_split(df, train_size=0.75, random_state=777, stratify=df.potable)
print(f"Train size: {train_df.shape}, Test size: {test_df.shape}")

Train size: (250710, 462), Test size: (83570, 462)


Now let us scale our numeric features based on the training set (excluding features used to directly compute potability).

In [7]:
scaler = StandardScaler()
num_features = [
    "Shape__Area", "Shape__Length",
    "emp", "qp1", "ap", "est",
    "n<5", "n5_9", "n10_19", "n20_49", "n50_99", "n100_249", "n250_499", "n500_999",
    "n1000", "n1000_1", "n1000_2", "n1000_3", "n1000_4",
    "Population"
] + df.columns[df.columns.str.contains("param_")].values.tolist()
num_features.remove("param_1,2-Dibromo-3-chloropropane (DBCP)")
num_features.remove("param_BHC-gamma (Lindane)")
num_features.remove("param_Bromodichloromethane")
num_features.remove("param_Bromoform")
num_features.remove("param_Carbon tetrachloride")
num_features.remove("param_Chlordane")
num_features.remove("param_Chloroform")
num_features.remove("param_Cyanazine")
num_features.remove("param_DDT (all isomers)")
num_features.remove("param_Dibromochloromethane")
num_features.remove("param_Dissolved Antimony")
num_features.remove("param_Dissolved Arsenic")
num_features.remove("param_Dissolved Barium")
num_features.remove("param_Dissolved Boron")
num_features.remove("param_Dissolved Cadmium")
num_features.remove("param_Dissolved Chromium")
num_features.remove("param_Dissolved Copper")
num_features.remove("param_Dissolved Fluoride")
num_features.remove("param_Dissolved Lead")
num_features.remove("param_Dissolved Manganese")
num_features.remove("param_Dissolved Mercury")
num_features.remove("param_Dissolved Nickel")
num_features.remove("param_Dissolved Nitrate")
num_features.remove("param_Dissolved Nitrate + Nitrite")
num_features.remove("param_Dissolved Selenium")
# num_features.remove("param_Dissolved Uranium") # Excluded due to missing lat/lon information
num_features.remove("param_Endrin")
num_features.remove("param_Molinate")
num_features.remove("param_Pentachlorophenol (PCP)")
num_features.remove("param_Simazine")
num_features.remove("param_Total Antimony")
num_features.remove("param_Total Arsenic")
num_features.remove("param_Total Barium")
num_features.remove("param_Total Cadmium")
num_features.remove("param_Total Chromium")
num_features.remove("param_Total Copper")
num_features.remove("param_Total Lead")
num_features.remove("param_Total Manganese")
num_features.remove("param_Total Mercury")
num_features.remove("param_Total Nickel")
num_features.remove("param_Total Selenium")
scaler.fit(train_df[num_features])
train_num = scaler.transform(train_df[num_features])
test_num = scaler.transform(test_df[num_features])

Let us one-hot encode our categorical features.

In [8]:
hot_enc = OneHotEncoder()
cat_features = ["AdminRegion", "FireMAR", "LawMAR"]
hot_enc.fit(train_df[cat_features])
train_hot = hot_enc.transform(train_df[cat_features])
test_hot = hot_enc.transform(test_df[cat_features])

And finally we combined all of the datasets into X,y for train and test.

In [9]:
train_X = sp.hstack((train_hot, train_num))
train_y = train_df.potable.values
test_X = sp.hstack((test_hot, test_num))
test_y = test_df.potable.values
print(f"Train X {train_X.shape}, y {len(train_y)}")
print(f"Test X {test_X.shape}, y {len(test_y)}")

Train X (250710, 428), y 250710
Test X (83570, 428), y 83570


Let us see how a logistic regression will perform.

In [10]:
model = LogisticRegression(random_state=777, max_iter=10000)
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
score_model(test_y, pred_y)

True Positive 77312 (Correctly classified potable)
True Negative: 523 (Correctly classified non-potable)
False Positive 5479 (Incorrectly classified non-potable as potable - Bad)
False Negative 256 (Incorrectly classified potable as non-potable - Meh)
              precision    recall  f1-score   support

           0    0.67137   0.08714   0.15425      6002
           1    0.93382   0.99670   0.96424     77568

    accuracy                        0.93137     83570
   macro avg    0.80260   0.54192   0.55925     83570
weighted avg    0.91497   0.93137   0.90606     83570



The recall and f1-scores didn't change very much (8% and 15% vs. the original 8% and 14%).

Now let us compare how a new Decision Tree regression will perform.

In [10]:
model = DecisionTreeClassifier(random_state=777)
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
score_model(test_y, pred_y)

True Positive 74930 (Correctly classified potable)
True Negative: 3379 (Correctly classified non-potable)
False Positive 2623 (Incorrectly classified non-potable as potable - Bad)
False Negative 2638 (Incorrectly classified potable as non-potable - Meh)
              precision    recall  f1-score   support

           0    0.56158   0.56298   0.56228      6002
           1    0.96618   0.96599   0.96608     77568

    accuracy                        0.93705     83570
   macro avg    0.76388   0.76449   0.76418     83570
weighted avg    0.93712   0.93705   0.93708     83570



The recall and f1-score for non=potable increased from 52% to 56%.  That moved the needle more than I expected.

Let us take a look at the decision tree to see which features ended up being used.

In [18]:
feature_names = list(hot_enc.get_feature_names_out(cat_features)) + num_features
print(export_text(model, feature_names=feature_names, max_depth=20))
with open("./decision_tree.txt", "w") as out:
    out.write(export_text(model, feature_names=feature_names, max_depth=100))

|--- param_Color <= 0.79
|   |--- param_Dissolved Sodium <= 0.08
|   |   |--- param_Total Iron <= -0.04
|   |   |   |--- param_Conductance <= -0.14
|   |   |   |   |--- param_Dichloroacetic Acid (DCAA) <= 21.58
|   |   |   |   |   |--- param_Dissolved Iron <= -0.04
|   |   |   |   |   |   |--- param_Color <= 0.27
|   |   |   |   |   |   |   |--- param_Dibromoacetic Acid (DBAA) <= 11.15
|   |   |   |   |   |   |   |   |--- param_Dissolved Sodium <= -0.11
|   |   |   |   |   |   |   |   |   |--- param_Total Potassium <= 0.51
|   |   |   |   |   |   |   |   |   |   |--- param_Total Titanium <= 13.28
|   |   |   |   |   |   |   |   |   |   |   |--- param_Total Cobalt <= 57.43
|   |   |   |   |   |   |   |   |   |   |   |   |--- param_Trihalomethane Formation Potential (THMFP) <= 37.67
|   |   |   |   |   |   |   |   |   |   |   |   |   |--- param_Suspended + Volatile Suspended Solids <= 11.42
|   |   |   |   |   |   |   |   |   |   |   |   |   |   |--- param_Total Iron <= -0.04
|   |   |  

Now let us compare how a new Random Forest regression will perform.

In [21]:
model = RandomForestClassifier(random_state=777)
model.fit(train_X, train_y)
pred_y = model.predict(test_X)
score_model(test_y, pred_y)

True Positive 76989 (Correctly classified potable)
True Negative: 3099 (Correctly classified non-potable)
False Positive 2903 (Incorrectly classified non-potable as potable - Bad)
False Negative 579 (Incorrectly classified potable as non-potable - Meh)
              precision    recall  f1-score   support

           0    0.84258   0.51633   0.64029      6002
           1    0.96366   0.99254   0.97789     77568

    accuracy                        0.95833     83570
   macro avg    0.90312   0.75443   0.80909     83570
weighted avg    0.95497   0.95833   0.95364     83570



And our f1-score for non-potable increased to 64% (from 56%)!  That is an amazing jump for water potability accuracy for data augmented with county specific data.  Also note that the recall for non-potable increased from 42% to 51% (from the random forest on the original subset of features).  The recall for the random forest is slightly worse than the recall for the decision tree on the same features.