<a href="https://www.kaggle.com/code/gokaysirin/water-quality-analysis-prediction?scriptVersionId=200136215" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Welcome to the Water Quality Dataset!
* In this dataset, we will analyze and predict water quality (specifically, whether it is drinkable or non-drinkable) using various metrics.

* It must be said, it is a great shame for humanity that in the 21st century, there are still people without access to clean water. Let’s hope this issue is resolved as soon as possible!

## A Quick Overview of the Data.

#### Water Quality Parameters Summary

1. **pH Value**:  
   pH indicates the acid-base balance of water. WHO recommends a pH range of 6.5 to 8.5. Current investigation shows values between 6.52 and 6.83, which are within this range.

2. **Hardness**:  
   Hardness is caused by dissolved calcium and magnesium salts, impacting water's ability to lather soap. It is determined by the contact time between water and geological deposits.

3. **Total Dissolved Solids (TDS)**:  
   TDS measures dissolved inorganic and organic minerals. A high TDS value indicates highly mineralized water. The desirable limit is 500 mg/L, with a maximum of 1000 mg/L for drinking purposes.

4. **Chloramines**:  
   Formed when ammonia is added to chlorine, chloramines disinfect water. Levels up to 4 mg/L are safe for drinking.

5. **Sulfate**:  
   Sulfates are naturally occurring and prevalent in soil and rocks. Concentrations in freshwater range from 3 to 30 mg/L, with some areas having higher levels up to 1000 mg/L.

6. **Conductivity**:  
   Electrical conductivity increases with the concentration of dissolved ions. WHO recommends a maximum conductivity of 400 μS/cm.

7. **Total Organic Carbon (TOC)**:  
   TOC measures carbon from organic compounds. For drinking water, TOC should be less than 2 mg/L, with source water having up to 4 mg/L.

8. **Trihalomethanes (THMs)**:  
   Formed during chlorine treatment, THMs should not exceed 80 ppm in drinking water.

9. **Turbidity**:  
   A measure of water clarity, affected by suspended solids. WHO recommends a turbidity value below 5 NTU, with 0.98 NTU observed in the study.

10. **Potability**:  
    Indicates if water is safe for consumption, where 1 means potable and 0 means not potable.


## What are We Going to Do?
* Data Preprocessing
* Missing Data Analysis
* Data Visualization
* Model Building with sklearn Library

# Data Preproccessing

## Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score,classification_report,ConfusionMatrixDisplay,precision_score,confusion_matrix
import plotly.express as px
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV,RepeatedStratifiedKFold,cross_val_score
import missingno as msno
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier

## Import Dataset&Take a Look

In [None]:
df = pd.read_csv("/kaggle/input/water-potability/water_potability.csv")
df.head()

* It seems that we have some missing data, so we’ve got some work ahead of us! :)


In [None]:
df.describe()

* We have columns with different ranges, for example, the Trihalomethanes column ranges from 0.73 to 124, while the pH column has values between 0 and 14. In the upcoming steps, we will scale these values to enhance our analysis processes.

## Missing Value Analysis

In [None]:
df.info()

* We have missing data in the pH, Sulfate, and Trihalomethanes columns. We will fill these gaps with the mean values, thus addressing the missing data without affecting the relationships between the columns.

In [None]:
df["ph"] = df["ph"].fillna(value=df["ph"].mean())
df["Sulfate"] = df["Sulfate"].fillna(value=df["Sulfate"].mean())
df["Trihalomethanes"] = df["Trihalomethanes"].fillna(value=df["Trihalomethanes"].mean())

* Just to be safe, let's double-check!

In [None]:
df.info()

* Everything is ready, let’s move forward!

# Data Visualization

In [None]:
plt.figure(figsize=(8, 6))
ax = sns.countplot(data=df, x="Potability", palette="deep")

# Calculate percentages
total = len(df)
for p in ax.patches:
    percentage = f'{100 * p.get_height() / total:.2f}%'
    ax.annotate(percentage, 
                (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha = 'center', va = 'baseline', 
                fontsize = 12, color = 'black', xytext = (0, 5), 
                textcoords = 'offset points')

# Show plot
plt.title("Potability Count with Percentages")
plt.show()

* The majority of our data consists of non-drinkable water samples.

In [None]:
sns.clustermap(df.corr(), cmap="vlag",dendrogram_ratio=(0.1,0.2),annot=True,linewidth=.8,figsize=(9,10))

> There could be several possible reasons for the low correlations. For instance, the water samples in our dataset might have been collected from a wide variety of different sources. A lack of variation in our dataset could also lead to the same issue. Additionally, if the analyzed data was collected at different times or locations, it could contribute to the low correlations. For example, some parameters might vary seasonally.

In [None]:
potability_zero = df.query("Potability == 0")
potable = df.query("Potability == 1")

plt.figure(figsize = (15,15))
for ax, col in enumerate(df.columns[:9]):
    plt.subplot(3,3,ax+1)
    plt.title(col)
    sns.kdeplot(x=potability_zero[col],label = "Non Potable")
    sns.kdeplot(x = potable[col],label = "Potable")
    plt.legend()
plt.tight_layout()

> In some column relationships, the data for drinkable and non-drinkable water appears almost identical. Fortunately, a few values like pH and Sulfate seem like they might help our model.

In [None]:
sns.scatterplot(x="ph",y="Potability",data=df)

# Modelling

## Train Test Split

In [None]:
X = df.drop("Potability",axis=1)
y= df["Potability"]

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=13)

## Scaling

In [None]:
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

## Model Building

In [None]:
models =[
    ("Decision Tree Classifier", DecisionTreeClassifier(max_depth=3)),
    ("Random Forest", RandomForestClassifier())
]

In [None]:
finalResults = []

cmList = []

for name, model in models:
    model.fit(X_train_scaled,y_train)
    model_result = model.predict(X_test_scaled)
    score = precision_score(y_test,model_result)
    cm = confusion_matrix(y_test, model_result)

    finalResults.append((name,score))
    cmList.append((name,cm))
finalResults

* Although these are not the worst results in the world, they are definitely not great either. Let's first take a look at the Confusion Matrix, and then we’ll work on improving our model.

In [None]:
for name, i in cmList:
    plt.figure()
    sns.heatmap(i,annot=True,linewidths=0.7,fmt=".1f")
    plt.title(name)
    plt.show

* Let's try to build something better with Random Forest

## Hyperparameter Search

In [None]:
model_params = {
    "Random Forest" :
    {
        "model": RandomForestClassifier(),
        "params":
        {
            "n_estimators":[10,50,100,200,500],
            "max_features":["auto","sqrt","log2"],
            "max_depth":list(range(1,15,3))
        }
    }
}
model_params

In [None]:
cv = RepeatedStratifiedKFold(n_splits=5,n_repeats=2)
scores=[]
for model_name,params in model_params.items():
    rs = RandomizedSearchCV(params["model"],params["params"],cv=cv,n_iter=10)
    rs.fit(X,y)
    scores.append([model_name,dict(rs.best_params_),rs.best_score_])
scores

> A slight improvement, but still not a result we can call very good. Although trying different models could take us further, we can stop here for now.

In [None]:
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='g', cmap='Blues')
plt.title("Confusion Matrix - Random Forest")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

# Conclusion

We can’t say we wrote the best code in the world, but we’ve definitely made progress. Much better results can be achieved with different models and approaches. If you have any experiments or feedback, I’m here and happy to assist. Stay healthy!