# Phase 3 – Final Report: Water Potability (IT326)

This notebook combines the work of **Phase 1, Phase 2, and Phase 3** in a single, clean file:

- Introduces the **problem**, **data mining task**, and **dataset**.
- Shows the main **data preprocessing** pipeline (once, without duplicated code).
- Applies **classification** (Decision Trees).
- Applies **clustering** (K-Means) as required.
- Summarizes **evaluation, comparison, and findings**.

> **Note:** You can adjust the text (problem description, references, etc.) to match your report and research paper exactly.


## [1] Problem

Access to **safe drinking water** is a critical global challenge. Contaminated water can cause serious health problems such as gastrointestinal diseases, neurological disorders, and other long-term complications.

In this project, we work with a water quality dataset to decide whether a given water sample is **potable (safe to drink)** or **not potable** based on measured chemical and physical attributes. Our goal is to support decision-makers and water treatment facilities by:

- **Predicting** if a water sample is potable or not.
- **Discovering patterns / groups** of water samples using clustering.


## [2] Data Mining Task

We consider two main data mining tasks:

1. **Classification**  
   - Task: Predict whether a water sample is **potable** (`1`) or **not potable** (`0`).  
   - Technique: **Decision Tree classifier** using two splitting criteria:
     - Gini index
     - Information Gain (entropy)

2. **Clustering**  
   - Task: Group water samples into clusters to understand hidden structure in the data.  
   - Technique: **K-Means** algorithm with several values of K, using:
     - **Elbow method** (based on within-cluster sum of squares / inertia)
     - **Silhouette coefficient** for cluster quality


## [3] Data

The dataset contains measurements for multiple water samples. Each row represents **one sample**, and each column is a measured attribute.

Typical attributes include (depending on the original dataset version):

- **pH** – acidity/basicity of the water.
- **Hardness** – amount of dissolved calcium and magnesium.
- **Solids** – total dissolved solids.
- **Chloramines**
- **Sulfate**
- **Conductivity**
- **Organic carbon**
- **Trihalomethanes**
- **Turbidity**
- **Potability** – **target label** (1 = potable, 0 = not potable).

In the next cell, we load the dataset from the project repository and briefly inspect its structure.


In [None]:
# === Imports ===
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    ConfusionMatrixDisplay,
    silhouette_score
)
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

plt.rcParams['figure.figsize'] = (8, 5)

# === Load raw dataset from GitHub ===
url = "https://raw.githubusercontent.com/lin-010/IT326-Water-Potability/refs/heads/main/Dataset/Raw_dataset.csv"
df_raw = pd.read_csv(url)

print("Raw dataset shape:", df_raw.shape)
df_raw.head()


In [None]:
# Quick info and missing values

print("\nDataset info:")
df_raw.info()

print("\nMissing values per column:")
print(df_raw.isnull().sum())


In [None]:
# Basic statistical summary
df_raw.describe()


## [4] Data Preprocessing

In this section, we prepare the data for both **classification** and **clustering**.

Main steps:

1. **Handle missing values**  
   - For numeric attributes, we replace missing values with the **median** of each column.
   - This is a simple and robust strategy that reduces the impact of outliers.

2. **Ensure correct data types**  
   - The target column `Potability` is kept as an integer/binary label.

3. **Create two versions of the dataset**  
   - `df_clean`: cleaned dataset (after handling missing values).  
   - `X_clust`: feature matrix (only input attributes, without the target) to be used for clustering.  
   - For clustering, we will also apply **standardization** using `StandardScaler`.

> If you already used a different preprocessing strategy in Phase 2 (e.g., other imputation, removing outliers), you can replace the code below with your exact steps. The important point is that the preprocessing is written **once** in this notebook and then reused for both classification and clustering.


In [None]:
# Copy the raw dataset for preprocessing
df = df_raw.copy()

# Name of the target column (adjust if your column name is different)
target_col = "Potability"

# Check that the target column exists
if target_col not in df.columns:
    raise ValueError(f"Target column '{target_col}' not found. Please update 'target_col' to match your dataset.")

# Handle missing values: fill numeric columns with their median
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()

for col in numeric_cols:
    median_value = df[col].median()
    df[col].fillna(median_value, inplace=True)

# (Optional) If there are non-numeric columns you decided to drop or encode,
# you can handle them here. For this dataset we expect mostly numeric columns.

# Ensure target is integer (0/1)
df[target_col] = df[target_col].astype(int)

print("Cleaned dataset shape:", df.shape)
print("Any remaining missing values?", df.isnull().sum().any())
df.head()


In [None]:
# Separate features (X) and target (y) for classification
X = df.drop(columns=[target_col])
y = df[target_col]

print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("\nClass distribution:")
print(y.value_counts())


## [5] Data Mining Technique

### 5.1 Classification – Decision Trees

For the classification task, we use the **DecisionTreeClassifier** from the `sklearn.tree` package.

We compare two attribute selection measures:

- **Gini index** (default in scikit-learn)
- **Information Gain (entropy)**

For each criterion, we train and evaluate models using three train-test partitions:

- 90% train – 10% test
- 80% train – 20% test
- 70% train – 30% test

For each model we compute:

- Accuracy
- Confusion matrix

At the end, we identify the **best model** and visualize its decision tree.


In [None]:
# Train and evaluate Decision Trees with different criteria and train-test splits

splits = [
    (0.9, 0.1),
    (0.8, 0.2),
    (0.7, 0.3),
]

criteria = ["gini", "entropy"]

results_clf = []

for train_size, test_size in splits:
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        train_size=train_size,
        stratify=y,
        random_state=42
    )
    
    for crit in criteria:
        clf = DecisionTreeClassifier(
            criterion=crit,
            random_state=42
        )
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)

        acc = accuracy_score(y_test, y_pred)
        cm = confusion_matrix(y_test, y_pred)

        results_clf.append({
            "criterion": crit,
            "train_size": f"{int(train_size*100)}%",
            "test_size": f"{int(test_size*100)}%",
            "accuracy": acc,
            "confusion_matrix": cm,
            "model": clf,
            "X_test": X_test,
            "y_test": y_test
        })

# Summary table with accuracies
summary_rows = [
    {
        "Criterion": r["criterion"],
        "Train %": r["train_size"],
        "Test %": r["test_size"],
        "Accuracy": round(r["accuracy"], 4)
    }
    for r in results_clf
]

clf_results_df = pd.DataFrame(summary_rows)
clf_results_df


In [None]:
# Plot confusion matrices for each trained model

for r in results_clf:
    cm = r["confusion_matrix"]
    clf = r["model"]
    X_test = r["X_test"]
    y_test = r["y_test"]

    disp = ConfusionMatrixDisplay(
        confusion_matrix=cm,
        display_labels=clf.classes_
    )
    disp.plot()
    plt.title(
        f"Confusion Matrix – criterion={r['criterion']}, "
        f"train={r['train_size']}, test={r['test_size']}"
    )
    plt.show()


In [None]:
# Find and visualize the best performing model

best_model = max(results_clf, key=lambda r: r["accuracy"])
best_clf = best_model["model"]

print("Best model parameters:")
print("Criterion:", best_model["criterion"])
print("Train size:", best_model["train_size"])
print("Test size:", best_model["test_size"])
print("Accuracy:", round(best_model["accuracy"], 4))

plt.figure(figsize=(16, 8))
plot_tree(
    best_clf,
    feature_names=X.columns,
    class_names=["Not potable", "Potable"],
    filled=True,
    fontsize=8
)
plt.title("Decision Tree – Best Model")
plt.show()


### 5.2 Clustering – K-Means

For the clustering task, we use the **K-Means** algorithm from the `sklearn.cluster` package.

Steps:

1. Use all **feature columns** (without the target label) as input.
2. Apply **standardization** using `StandardScaler` so that all attributes have similar scale.
3. Try several values of **K** (number of clusters), for example: 2, 3, 4, 5.
4. For each K, compute:
   - **Total within-cluster sum of squares (inertia)** – used in the **Elbow method**.
   - **Average silhouette coefficient** – measures cluster quality.
5. Choose the best K based on silhouette and Elbow plot.
6. Visualize the clusters in 2D using two selected features.


In [None]:
# Prepare data for clustering

# X_clust = feature matrix for clustering (without the target label)
X_clust = df.drop(columns=[target_col])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_clust)

print("Shape of X_scaled:", X_scaled.shape)


In [None]:
# Try multiple values of K

ks = [2, 3, 4, 5]
silhouette_scores = []
inertias = []
kmodels = {}

for k in ks:
    kmeans = KMeans(
        n_clusters=k,
        n_init=10,
        random_state=42
    )
    labels = kmeans.fit_predict(X_scaled)

    inertia = kmeans.inertia_
    inertias.append(inertia)

    sil_score = silhouette_score(X_scaled, labels)
    silhouette_scores.append(sil_score)

    kmodels[k] = {
        "model": kmeans,
        "labels": labels,
        "silhouette": sil_score,
        "inertia": inertia
    }

# Summary table for clustering results
clust_summary = pd.DataFrame({
    "K": ks,
    "Average Silhouette": np.round(silhouette_scores, 4),
    "Total Within-Cluster SS (Inertia)": np.round(inertias, 2)
})
clust_summary


In [None]:
# Elbow plot (Inertia vs K)

plt.plot(ks, inertias, marker="o")
plt.xlabel("Number of clusters K")
plt.ylabel("Total within-cluster sum of squares (Inertia)")
plt.title("Elbow Method for K-Means")
plt.grid(True)
plt.show()


In [None]:
# Silhouette coefficient vs K

plt.plot(ks, silhouette_scores, marker="o")
plt.xlabel("Number of clusters K")
plt.ylabel("Average Silhouette Coefficient")
plt.title("Silhouette Coefficient for different K")
plt.grid(True)
plt.show()


In [None]:
# Choose best K based on silhouette score

best_k_index = int(np.argmax(silhouette_scores))
best_k = ks[best_k_index]

print("Best K based on silhouette score:", best_k)
print("\nClustering summary:")
print(clust_summary)


In [None]:
# Visualize clusters in 2D using two selected features

feature_names = X_clust.columns
feat_x = feature_names[0]
feat_y = feature_names[1]

for k in ks:
    labels = kmodels[k]["labels"]
    
    plt.scatter(
        X_clust[feat_x],
        X_clust[feat_y],
        c=labels,
        s=10
    )
    plt.xlabel(feat_x)
    plt.ylabel(feat_y)
    plt.title(f"K-Means Clusters (K={k}) using {feat_x} vs {feat_y}")
    plt.show()


## [6] Evaluation and Comparison

### 6.1 Classification Results and Discussion

Using Decision Trees, we compared:

- **Criteria**: Gini vs Entropy.
- **Train–Test splits**: 90–10, 80–20, 70–30.

From the results table:

- We can identify which **criterion** gives higher accuracy overall.
- We can check if a more balanced split (e.g., 80–20 or 70–30) leads to more stable performance compared to 90–10.

When discussing the results in the PDF report, it is important to:

- Highlight the **best model** (criterion, split, accuracy).
- Comment on the **confusion matrices** (e.g., more false positives vs false negatives).
- Compare the obtained accuracy with the results reported in the **research paper** (even if it is only an approximate comparison).

### 6.2 Clustering Results and Discussion

From the clustering summary:

- We compare the **Average Silhouette** scores across different K values.
- We inspect the **Elbow plot** to see where the inertia decreases start to slow down.
- We select the **best K** according to the silhouette and Elbow method.

In the discussion, we can mention:

- Why we chose a specific K as the final number of clusters.
- How the clusters differ in terms of some attributes (for example, one cluster may correspond to samples with higher solids or different pH values).
- Any relation between the clusters and the `Potability` label (even if the label is not used during clustering).


## [7] Findings and Discussion

In this section you summarize the main insights from both classification and clustering. For example:

- **Classification**
  - Decision Trees achieved an accuracy of around *X%* using the best configuration.
  - The most important attributes (according to the tree structure) may include features such as **pH**, **Solids**, **Sulfate**, etc.
  - Compared to the research paper, our results are (similar / slightly lower / higher), possibly due to differences in preprocessing, parameter settings, or dataset size.

- **Clustering**
  - K-Means discovered *K* meaningful clusters in the data.
  - Clusters show different profiles of water quality attributes (e.g., one cluster with lower turbidity and higher pH).
  - Clustering can help understand natural groupings of water samples and may support decision-making about which samples require treatment.

Finally, you can state which technique (classification vs clustering) is more useful for the **end user** (e.g., a water treatment engineer):
- Classification is useful for **predicting potability** of new samples.
- Clustering is useful for **exploratory analysis** and understanding structure in the data.


## [8] References

Add your references here in the required format (e.g., IEEE). Typical references:

1. The original **water potability dataset** (Kaggle link).
2. The **research paper** you used for comparison.
3. Any additional articles or documentation you used for the methods (e.g., scikit-learn documentation).

Make sure the same references are used consistently in both the **notebook**, the **PDF final report**, and the **presentation**.
