# Phase 3 – Integrated Data Mining Notebook  
## Water Potability Prediction (IT326) /  توقّع صلاحية مياه الشرب

This notebook presents

1. **Phase 1 – Exploratory Data Analysis (EDA)**  
2. **Phase 2 – Data Preprocessing**  
3. **Phase 3 – Classification (Decision Trees) and Clustering (K-means)**  



## 1. Environment Setup / إعداد بيئة العمل

In [None]:
# Import required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

print("Libraries imported successfully.")

## 2. Problem & Data Overview / وصف المشكلة والبيانات (Phase 1)

- **Problem:** Predict whether a water sample is potable (safe to drink) based on its chemical attributes.  
- **Dataset:** Water Potability dataset (Kaggle).  
- **Target attribute:** `Potability` (0 = not potable, 1 = potable).  

> الهدف: بناء نماذج تنقيب بيانات لتوقّع صلاحية الماء للشرب، بالإضافة إلى اكتشاف مجموعات طبيعية (Clusters) داخل البيانات.


### 2.1 Load Raw Dataset / قراءة البيانات الخام

In [None]:
# Load RAW dataset from GitHub
# Update the URL if the path or file name changes in the repository.

url_raw = "https://raw.githubusercontent.com/lin-010/IT326-Water-Potability/refs/heads/main/Dataset/Raw_dataset.csv"
df_raw = pd.read_csv(url_raw)

print("Raw dataset shape:", df_raw.shape)
df_raw.head()

## 3. Exploratory Data Analysis (EDA) / تحليل استكشافي (Phase 1)

Quick inspection of:

- Data types  
- Summary statistics  
- Class distribution  
- Missing values  

> في هذا الجزء نكوّن صورة عامة عن البيانات قبل البدء في المعالجة والبناء النماذج.


In [None]:
# Dataset info
df_raw.info()

In [None]:
# Summary statistics
df_raw.describe()

In [None]:
# Class distribution for Potability
if "Potability" in df_raw.columns:
    print("Class distribution (Potability):")
    print(df_raw["Potability"].value_counts())
else:
    print("Warning: 'Potability' column not found – check dataset.")

In [None]:
# Missing values per column
df_raw.isnull().sum()

## 4. Data Preprocessing / معالجة البيانات (Phase 2)

The preprocessing pipeline used in this notebook:

1. Work on a copy of the raw dataset.  
2. Handle missing numeric values using **median imputation**.  
3. Keep the numeric scale as-is for Decision Trees (no scaling required).  
4. Later, apply **StandardScaler** separately for K-means (distance-based).  


In [None]:
# Start preprocessing from the raw dataset
df = df_raw.copy()

# Median imputation for numeric columns
numeric_cols = df.select_dtypes(include=["float64", "int64"]).columns
for col in numeric_cols:
    median_value = df[col].median()
    df[col].fillna(median_value, inplace=True)

# Check missing values after imputation
print("Missing values after median imputation:")
print(df.isnull().sum())

print("\nPreprocessed dataset shape:", df.shape)
df.head()

### 4.1 Features and Target / فصل المتغيرات عن المخرج

- `X` = input features (attributes)  
- `y` = target class label (`Potability`)

In [None]:
# Separate features (X) and target (y)

target_column = "Potability"

if target_column not in df.columns:
    raise ValueError(f"Target column '{target_column}' not found in dataframe.")

X = df.drop(target_column, axis=1)
y = df[target_column]

print("Shape of features (X):", X.shape)
print("Shape of target (y):", y.shape)
print("\nClass distribution:")
print(y.value_counts())

## 5. Decision Tree Classification / التصنيف باستخدام شجرة القرار (Phase 3)

We build Decision Tree models using:

- **Criteria:** `gini` and `entropy`  
- **Train/Test splits:** 90/10, 80/20, 70/30  

For each configuration we compute:

- Accuracy on the test set  
- Confusion matrix  



In [None]:
# Helper function to train and evaluate a Decision Tree

def run_decision_tree(X, y, train_size, criterion, random_state=42):
    """Train a DecisionTreeClassifier and return model, accuracy, confusion matrix, and data splits."""
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        train_size=train_size,
        random_state=random_state,
        stratify=y
    )

    clf = DecisionTreeClassifier(criterion=criterion, random_state=random_state)
    clf.fit(X_train, y_train)

    y_pred = clf.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    return clf, acc, cm, (X_train, X_test, y_train, y_test, y_pred)


train_sizes = [0.9, 0.8, 0.7]
criteria = ["gini", "entropy"]

results_clf = []       # summary table
models_clf = {}        # store models and confusion matrices

for crit in criteria:
    for ts in train_sizes:
        clf, acc, cm, data_split = run_decision_tree(X, y, train_size=ts, criterion=crit)

        results_clf.append({
            "criterion": crit,
            "train_size": ts,
            "test_size": round(1 - ts, 2),
            "accuracy": acc
        })

        key = f"{crit}_train{int(ts * 100)}"
        models_clf[key] = {
            "model": clf,
            "confusion_matrix": cm,
            "data_split": data_split
        }

results_clf_df = pd.DataFrame(results_clf)
results_clf_df.sort_values(by=["train_size", "criterion"], ascending=[False, True])

In [None]:
# Confusion matrices for all Decision Tree configurations

for key, info in models_clf.items():
    cm = info["confusion_matrix"]
    _, X_test, _, y_test, y_pred = info["data_split"]

    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0, 1])
    disp.plot(values_format='d')
    plt.title(f"Confusion Matrix - {key.replace('_', ', ')}")
    plt.show()

In [None]:
# Best model (highest accuracy) and tree visualization

best_idx = results_clf_df["accuracy"].idxmax()
best_row = results_clf_df.loc[best_idx]
best_key = f"{best_row['criterion']}_train{int(best_row['train_size'] * 100)}"

print("Best model settings:")
print(best_row)

best_model = models_clf[best_key]["model"]

plt.figure(figsize=(16, 8))
plot_tree(
    best_model,
    feature_names=X.columns,
    class_names=["Not potable (0)", "Potable (1)"],
    filled=True,
    rounded=True,
    fontsize=8
)
plt.title(f"Decision Tree - Best model ({best_key})")
plt.show()

## 6. K-means Clustering / التجميع باستخدام K-means (Phase 3)

We apply K-means clustering on the feature space (without the label):

1. Drop `Potability` from the feature set.  
2. Standardize features using `StandardScaler`.  
3. Try K ∈ {2, 3, 4, 5}.  
4. For each K, compute:
   - Total within-cluster sum of squares (**inertia**)  
   - Average **silhouette score**  
5. Use Elbow and Silhouette plots to choose the most suitable K.  



In [None]:
# Feature space for K-means (without target)

X_features = df.drop(target_column, axis=1)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_features)

# K-means for different K values
k_values = [2, 3, 4, 5]

kmeans_results = []
cluster_labels_dict = {}

for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)

    inertia = kmeans.inertia_
    sil_score = silhouette_score(X_scaled, labels)

    kmeans_results.append({
        "K": k,
        "inertia": inertia,
        "silhouette": sil_score
    })

    cluster_labels_dict[k] = {
        "model": kmeans,
        "labels": labels
    }

kmeans_results_df = pd.DataFrame(kmeans_results)
kmeans_results_df.sort_values("K")

In [None]:
# Elbow plot (K vs inertia)

plt.plot(kmeans_results_df["K"], kmeans_results_df["inertia"], marker="o")
plt.xlabel("Number of clusters (K)")
plt.ylabel("Total within-cluster sum of squares (Inertia)")
plt.title("Elbow Method for K-means")
plt.xticks(k_values)
plt.show()

# Silhouette plot (K vs silhouette)

plt.plot(kmeans_results_df["K"], kmeans_results_df["silhouette"], marker="o")
plt.xlabel("Number of clusters (K)")
plt.ylabel("Average Silhouette Coefficient")
plt.title("Silhouette Scores for Different K")
plt.xticks(k_values)
plt.show()

In [None]:
# Best K based on silhouette score and PCA visualization

best_k_idx = kmeans_results_df["silhouette"].idxmax()
best_k = int(kmeans_results_df.loc[best_k_idx, "K"])
print("Best K based on silhouette:", best_k)

best_labels = cluster_labels_dict[best_k]["labels"]

# PCA to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=best_labels, alpha=0.6)
plt.xlabel("PCA Component 1")
plt.ylabel("PCA Component 2")
plt.title(f"K-means Clusters (K = {best_k}) in PCA space")
plt.show()

# Attach cluster labels to original data and summarize

df_clusters = df.copy()
df_clusters["cluster"] = best_labels

cluster_summary = df_clusters.groupby("cluster").mean(numeric_only=True)
cluster_summary

## 7. Summary of Results / ملخّص النتائج

This section summarizes the main outputs obtained from the analysis:

- `results_clf_df`: جدول يوضّح دقة نماذج شجرة القرار مع اختلاف معيار التقسيم ونسب تقسيم البيانات إلى تدريب/اختبار.  
- Confusion matrices: توضح أداء النموذج في التمييز بين العينات الصالحة للشرب (1) وغير الصالحة (0).  
- Best Decision Tree plot: يوضّح أهم المتغيرات المؤثرة في التنبؤ بصلاحية الماء وبعض قواعد التصنيف المستخلصة من الشجرة.  
- `kmeans_results_df`: جدول يحتوي قيم الـ inertia والـ silhouette لمجموعة من قيم K في خوارزمية K-means.  
- Elbow & Silhouette plots: تساعد في اختيار قيمة K الأنسب بناءً على شكل المنحنيات.  
- `cluster_summary` مع مخطط الـ PCA: يبيّن خصائص كل Cluster من حيث متوسط قيم المتغيرات وربطها بجودة الماء.

