# Water Quality Model

The water quality model employs supervised machine learning techniques to classify water as potable (1) or not potable (0), making this a binary classification problem since the output variable is categorical. Supervised learning is a type of machine learning where the model is trained on labeled data. In this context, "labeled data" means that each training example is paired with an output label. The goal of the model is to learn the mapping from inputs to outputs based on the provided labels. 

## Data

The data is stored in a CSV file located in the `data` folder under the name `water_potability.csv`. The first task involves performing Exploratory Data Analysis (EDA) to identify any discrepancies in the dataset, normalize the data, and visualize it effectively.

In [30]:

# import modules needed

import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

In [28]:
# read csv file and print the first 10 rows

data = pd.read_csv('data/water_potability.csv')
print(data.head(10))

# Number of rows and columns in entire dataset
print(f'Data has {data.shape[0]} rows and {data.shape[1]} columns')

          ph    Hardness        Solids  Chloramines     Sulfate  Conductivity  \
0        NaN  204.890455  20791.318981     7.300212  368.516441    564.308654   
1   3.716080  129.422921  18630.057858     6.635246         NaN    592.885359   
2   8.099124  224.236259  19909.541732     9.275884         NaN    418.606213   
3   8.316766  214.373394  22018.417441     8.059332  356.886136    363.266516   
4   9.092223  181.101509  17978.986339     6.546600  310.135738    398.410813   
5   5.584087  188.313324  28748.687739     7.544869  326.678363    280.467916   
6  10.223862  248.071735  28749.716544     7.513408  393.663396    283.651634   
7   8.635849  203.361523  13672.091764     4.563009  303.309771    474.607645   
8        NaN  118.988579  14285.583854     7.804174  268.646941    389.375566   
9  11.180284  227.231469  25484.508491     9.077200  404.041635    563.885481   

   Organic_carbon  Trihalomethanes  Turbidity  Potability  
0       10.379783        86.990970   2.963135   

### Check for the availability and number of NaN values in the dataset

In [10]:
# check for Nan Values in each column
nan_counts = data.isnull().sum()

print(f'Nan Values in each column: {nan_counts}')

Nan Values in each column: ph                 491
Hardness             0
Solids               0
Chloramines          0
Sulfate            781
Conductivity         0
Organic_carbon       0
Trihalomethanes    162
Turbidity            0
Potability           0
dtype: int64


From the information above the columns sulfate and Trihalomethanes both have a good number of NaN Values. 

### Replace NaN values with the mean of the column

In [21]:
# Replace NaN values in 'sulfate' and 'Trihalomethanes' with the mean of the respective column
data['ph'] = data['ph'].fillna(data['ph'].mean())
data['Sulfate'] = data['Sulfate'].fillna(data['Sulfate'].mean())
data['Trihalomethanes'] = data['Trihalomethanes'].fillna(data['Trihalomethanes'].mean())

print("NaN values after replacement:")
print(data.isnull().sum())

NaN values after replacement:
ph                 0
Hardness           0
Solids             0
Chloramines        0
Sulfate            0
Conductivity       0
Organic_carbon     0
Trihalomethanes    0
Turbidity          0
Potability         0
dtype: int64


In [22]:
print(data.head(10))

          ph    Hardness        Solids  Chloramines     Sulfate  Conductivity  \
0   7.080795  204.890455  20791.318981     7.300212  368.516441    564.308654   
1   3.716080  129.422921  18630.057858     6.635246  333.775777    592.885359   
2   8.099124  224.236259  19909.541732     9.275884  333.775777    418.606213   
3   8.316766  214.373394  22018.417441     8.059332  356.886136    363.266516   
4   9.092223  181.101509  17978.986339     6.546600  310.135738    398.410813   
5   5.584087  188.313324  28748.687739     7.544869  326.678363    280.467916   
6  10.223862  248.071735  28749.716544     7.513408  393.663396    283.651634   
7   8.635849  203.361523  13672.091764     4.563009  303.309771    474.607645   
8   7.080795  118.988579  14285.583854     7.804174  268.646941    389.375566   
9  11.180284  227.231469  25484.508491     9.077200  404.041635    563.885481   

   Organic_carbon  Trihalomethanes  Turbidity  Potability  
0       10.379783        86.990970   2.963135   

In [23]:
# Separate our data to X -> Feature Columms and Y -> Output Label

X = data.iloc[:, 0:9]
Y = data['Potability']

# Display first few rows of X and Y to verify
print(f'X features: {X.head()}')
print(f'Y target: {Y.head()}')

X features:          ph    Hardness        Solids  Chloramines     Sulfate  Conductivity  \
0  7.080795  204.890455  20791.318981     7.300212  368.516441    564.308654   
1  3.716080  129.422921  18630.057858     6.635246  333.775777    592.885359   
2  8.099124  224.236259  19909.541732     9.275884  333.775777    418.606213   
3  8.316766  214.373394  22018.417441     8.059332  356.886136    363.266516   
4  9.092223  181.101509  17978.986339     6.546600  310.135738    398.410813   

   Organic_carbon  Trihalomethanes  Turbidity  
0       10.379783        86.990970   2.963135  
1       15.180013        56.329076   4.500656  
2       16.868637        66.420093   3.055934  
3       18.436524       100.341674   4.628771  
4       11.558279        31.997993   4.075075  
Y target: 0    0
1    0
2    0
3    0
4    0
Name: Potability, dtype: int64


### Normalize the Data

In [27]:
scaler = MinMaxScaler()

def normalize_data(X):
    '''
    Normalizes a data
    '''
    normalized_data = scaler.fit_transform(X)
    
    return normalized_data 

X_normalized = normalize_data(X)
print(type(X_normalized))

X_normalized_df = pd.DataFrame(X_normalized, columns=X.columns)

# Display first few rows of the normalized data
print(X_normalized_df.head())
    

<class 'numpy.ndarray'>
         ph  Hardness    Solids  Chloramines   Sulfate  Conductivity  \
0  0.505771  0.571139  0.336096     0.543891  0.680385      0.669439   
1  0.265434  0.297400  0.300611     0.491839  0.581699      0.719411   
2  0.578509  0.641311  0.321619     0.698543  0.581699      0.414652   
3  0.594055  0.605536  0.356244     0.603314  0.647347      0.317880   
4  0.649445  0.484851  0.289922     0.484900  0.514545      0.379337   

   Organic_carbon  Trihalomethanes  Turbidity  
0        0.313402         0.699753   0.286091  
1        0.497319         0.450999   0.576793  
2        0.562017         0.532866   0.303637  
3        0.622089         0.808065   0.601015  
4        0.358555         0.253606   0.496327  


## Building a Neural Network for Binary Classification using tensorflow

In [40]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape = (9,)),
    tf.keras.layers.Dense(6, activation="tanh"),
    tf.keras.layers.Dense(4, activation="tanh"),
    tf.keras.layers.Dense(2, activation="softmax")
])

loss_function = tf.keras.losses.SparseCategoricalCrossentropy()
model.compile(optimizer = 'adam',
              loss = loss_function,
              metrics = ['accuracy'])

In [41]:
X = X_normalized

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

print(X_train.shape)

(2293, 9)


In [42]:
model.fit(X_train, Y_train, epochs = 1000)

Epoch 1/1000
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 1ms/step - accuracy: 0.5521 - loss: 0.6956  
Epoch 2/1000
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 814us/step - accuracy: 0.6004 - loss: 0.6764
Epoch 3/1000
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 798us/step - accuracy: 0.6088 - loss: 0.6778
Epoch 4/1000
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 775us/step - accuracy: 0.5871 - loss: 0.6801
Epoch 5/1000
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 748us/step - accuracy: 0.5970 - loss: 0.6819
Epoch 6/1000
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 759us/step - accuracy: 0.6043 - loss: 0.6761
Epoch 7/1000
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 748us/step - accuracy: 0.5872 - loss: 0.6848
Epoch 8/1000
[1m72/72[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 804us/step - accuracy: 0.5921 - loss: 0.6804
Epoch 9/1000
[1m72/72[

<keras.src.callbacks.history.History at 0x143fce690>

In [43]:
model.evaluate(X_test, Y_test)

[1m31/31[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 790us/step - accuracy: 0.6895 - loss: 0.5986


[0.5977222323417664, 0.6795523762702942]