# Phase 3 – Integrated Data Mining Notebook  
## Water Potability Prediction (IT326)

This notebook combines the work of **Phase 1 (EDA)**, **Phase 2 (Preprocessing)**, and **Phase 3 (Modeling)** in a single clean pipeline.

---

### Structure

1. **Problem & Data Overview (Phase 1)**
2. **Exploratory Data Analysis (Phase 1)**
3. **Preprocessing Pipeline (Phase 2)**
4. **Decision Tree Classification (Phase 3 – Classification)**
5. **K-means Clustering (Phase 3 – Clustering)**
6. **Summary of Results (for PDF & Presentation)**


In [None]:
# Phase 3 – Setup

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

print("Libraries imported successfully.")

## 1. Problem & Data Overview (Phase 1)

- **Goal:** Predict whether a water sample is **potable (safe to drink)** based on its chemical characteristics.  
- **Task type:** Supervised classification (for prediction) + unsupervised clustering (to discover natural groups in the data).  
- **Dataset:** Water Potability dataset (Kaggle).  
- **Target attribute:** `Potability` (0 = not potable, 1 = potable).

### 1.1 Load Raw Dataset

In [None]:
# Load RAW dataset from GitHub
# Update the URL if the file name/path changes in the repository.

url_raw = "https://raw.githubusercontent.com/lin-010/IT326-Water-Potability/refs/heads/main/Dataset/Raw_dataset.csv"
df_raw = pd.read_csv(url_raw)

print("Raw dataset shape:", df_raw.shape)
df_raw.head()

## 2. Exploratory Data Analysis (Phase 1)

Quick inspection of:

- Data types  
- Summary statistics  
- Class distribution  
- Missing values

In [None]:
# Data info
df_raw.info()

In [None]:
# Summary statistics for numeric columns
df_raw.describe()

In [None]:
# Class distribution for Potability (target)
if "Potability" in df_raw.columns:
    print("Class distribution (Potability):")
    print(df_raw["Potability"].value_counts())
else:
    print("Warning: 'Potability' column not found – check dataset.")

In [None]:
# Missing values per column
df_raw.isnull().sum()

## 3. Preprocessing Pipeline (Phase 2)

We apply a single preprocessing pipeline that will be reused in both classification and clustering:

1. Work on a copy of the raw data.  
2. Handle missing numeric values using **median imputation**.  
3. Keep the numeric scale as-is for Decision Trees (they do not require scaling).  
4. Later, apply **StandardScaler** separately for K-means only (because it is distance-based).

In [None]:
# Start preprocessing from the raw dataset
df = df_raw.copy()

# Median imputation for numeric columns
numeric_cols = df.select_dtypes(include=["float64", "int64"]).columns
for col in numeric_cols:
    median_value = df[col].median()
    df[col].fillna(median_value, inplace=True)

# Check missing values after imputation
print("Missing values after median imputation:")
print(df.isnull().sum())

print("\nPreprocessed dataset shape:", df.shape)
df.head()

### 3.1 Features and Target

We now separate:

- `X` = all predictor attributes  
- `y` = class label `Potability`

In [None]:
# Separate features (X) and target (y)

target_column = "Potability"

if target_column not in df.columns:
    raise ValueError(f"Target column '{target_column}' not found in dataframe.")

X = df.drop(target_column, axis=1)
y = df[target_column]

print("Shape of features (X):", X.shape)
print("Shape of target (y):", y.shape)
print("\nClass distribution:")
print(y.value_counts())

## 4. Decision Tree Classification (Phase 3 – Classification)

We train Decision Tree models using:

- Splitting criteria: **Gini** and **Entropy**  
- Train/test splits: **90/10**, **80/20**, **70/30**  

For each configuration we compute:

- Accuracy on the test set  
- Confusion matrix  

We then identify the best model and visualize its tree structure.

In [None]:
# Helper function for training and evaluating a Decision Tree

def run_decision_tree(X, y, train_size, criterion, random_state=42):
    """Train a DecisionTreeClassifier and return model, accuracy, confusion matrix, and data splits."""
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        train_size=train_size,
        random_state=random_state,
        stratify=y
    )

    clf = DecisionTreeClassifier(criterion=criterion, random_state=random_state)
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    return clf, acc, cm, (X_train, X_test, y_train, y_test, y_pred)


train_sizes = [0.9, 0.8, 0.7]
criteria = ["gini", "entropy"]

results_clf = []       # summary table
models_clf = {}        # store models and confusion matrices

for crit in criteria:
    for ts in train_sizes:
        clf, acc, cm, data_split = run_decision_tree(X, y, train_size=ts, criterion=crit)

        results_clf.append({
            "criterion": crit,
            "train_size": ts,
            "test_size": round(1 - ts, 2),
            "accuracy": acc
        })

        key = f"{crit}_train{int(ts * 100)}"
        models_clf[key] = {
            "model": clf,
            "confusion_matrix": cm,
            "data_split": data_split
        }

results_clf_df = pd.DataFrame(results_clf)
results_clf_df.sort_values(by=["train_size", "criterion"], ascending=[False, True])

In [None]:
# Confusion matrices for all configurations

for key, info in models_clf.items():
    cm = info["confusion_matrix"]
    _, X_test, _, y_test, y_pred = info["data_split"]

    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0, 1])
    disp.plot(values_format='d')
    plt.title(f"Confusion Matrix - {key.replace('_', ', ')}")
    plt.show()

In [None]:
# Best model (highest accuracy) and tree visualization

best_idx = results_clf_df["accuracy"].idxmax()
best_row = results_clf_df.loc[best_idx]
best_key = f"{best_row['criterion']}_train{int(best_row['train_size'] * 100)}"

print("Best model settings:")
print(best_row)

best_model = models_clf[best_key]["model"]

plt.figure(figsize=(16, 8))
plot_tree(
    best_model,
    feature_names=X.columns,
    class_names=["Not potable (0)", "Potable (1)"],
    filled=True,
    rounded=True,
    fontsize=8
)
plt.title(f"Decision Tree - Best model ({best_key})")
plt.show()

## 5. K-means Clustering (Phase 3 – Clustering)

We perform unsupervised clustering using **K-means** on the feature space (without the label):

1. Drop the `Potability` column.  
2. Standardize features using `StandardScaler`.  
3. Try multiple values of **K** (2, 3, 4, 5).  
4. For each K, compute:
   - Total within-cluster sum of squares (**inertia**)  
   - Average **silhouette score**  
5. Use:
   - **Elbow method** (K vs inertia)  
   - **Silhouette scores** (K vs silhouette)  
6. Visualize clusters in 2D using **PCA**. 

In [None]:
# Feature space for K-means (without target)

X_features = df.drop(target_column, axis=1)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_features)

# K-means for different K values
k_values = [2, 3, 4, 5]

kmeans_results = []
cluster_labels_dict = {}

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)

    inertia = kmeans.inertia_
    sil_score = silhouette_score(X_scaled, labels)

    kmeans_results.append({
        "K": k,
        "inertia": inertia,
        "silhouette": sil_score
    })

    cluster_labels_dict[k] = {
        "model": kmeans,
        "labels": labels
    }

kmeans_results_df = pd.DataFrame(kmeans_results)
kmeans_results_df.sort_values("K")

In [None]:
# Elbow plot (K vs inertia)

plt.plot(kmeans_results_df["K"], kmeans_results_df["inertia"], marker="o")
plt.xlabel("Number of clusters (K)")
plt.ylabel("Total within-cluster sum of squares (Inertia)")
plt.title("Elbow Method for K-means")
plt.xticks(k_values)
plt.show()

# Silhouette plot (K vs silhouette)

plt.plot(kmeans_results_df["K"], kmeans_results_df["silhouette"], marker="o")
plt.xlabel("Number of clusters (K)")
plt.ylabel("Average Silhouette Coefficient")
plt.title("Silhouette Scores for Different K")
plt.xticks(k_values)
plt.show()

In [None]:
# Best K based on silhouette score and PCA visualization

best_k_idx = kmeans_results_df["silhouette"].idxmax()
best_k = int(kmeans_results_df.loc[best_k_idx, "K"])
print("Best K based on silhouette:", best_k)

best_labels = cluster_labels_dict[best_k]["labels"]

# PCA to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=best_labels, alpha=0.6)
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title(f"K-means Clusters (K = {best_k}) in PCA space")
plt.show()

# Attach cluster labels to original data and summarize

df_clusters = df.copy()
df_clusters["cluster"] = best_labels

cluster_summary = df_clusters.groupby("cluster").mean(numeric_only=True)
cluster_summary

## 6. Summary of Results (For PDF Report & Presentation)

From this notebook you will use the following in the **final PDF report** and **presentation**:

- `results_clf_df`: Table of Decision Tree configurations (criterion, train/test split, accuracy).  
- Confusion matrix plots: To comment on which class is harder to predict.  
- Best Decision Tree visualization: To highlight important features and example rules.  
- `kmeans_results_df`: Table of K vs inertia vs silhouette to justify best K.  
- Elbow and silhouette plots: Visual justification for chosen K.  
- PCA cluster plot and `cluster_summary`: To describe cluster profiles and relate them to water quality.

In the PDF report you will:

- Explain **why** you used each method and from which **package** (e.g., `DecisionTreeClassifier` and `KMeans` from `sklearn`).  
- Compare your results (even roughly) with the **research paper** results.  
- Provide a clear **discussion and conclusion** based on these outputs.

This notebook is the **technical backbone** for Phase 3 and will be linked in the GitHub submission.
