# Phase3 – Final Report: Water Potability Dataset

## [1] Problem

Access to clean and safe drinking water is essential for human health. However, water sources may contain chemical and physical contaminants that make the water unsafe for human consumption.

In this project, we work with a *water potability* dataset to determine whether a given water sample is **potable (safe to drink)** or **not potable** based on several measured attributes (such as pH, hardness, solids, sulfate, and others). Our main goal is to support decision-making about water quality by applying data mining techniques to:
- Predict if a water sample is potable.
- Discover natural groups (clusters) of water samples.



This project aims to apply data mining techniques to the Water Potability dataset to predict whether water samples are safe for human consumption (Potability = 1) or not (Potability = 0).

## [2] Data Mining Task

We formalize the problem as two main data mining tasks:

1. **Classification task**  
   - Goal: predict the class label `Potability` for each water sample, where:
     - `0` = Not potable (not safe to drink)
     - `1` = Potable (safe to drink)
   - Technique: **Decision Tree classifier** using two attribute selection measures:
     - Gini index (default in scikit-learn)
     - Information Gain (entropy)

2. **Clustering task**  
   - Goal: group water samples into clusters based on their attribute values to discover hidden structure.
   - Technique: **K-Means clustering** with several values of K. We evaluate K using:
     - **Silhouette coefficient**
     - **Elbow method** (total within-cluster sum of squares / inertia)

These tasks will be applied to the **preprocessed dataset** prepared in Phase 2.

## [3] Data

In this section, we summarize the dataset information and basic exploration steps carried out in **Phase 1**.

### 3.1 Dataset Source and Description


## Dataset Source
- Dataset: Water Potability (Aditya Kadiwal)  
- Source: Kaggle  
- Kaggle link: https://www.kaggle.com/datasets/adityakadiwal/water-potability  
- Required local filename in repo: `Dataset/Raw_dataset.csv`


In [None]:

import pandas as pd

url = "https://raw.githubusercontent.com/lin-010/IT326-Water-Potability/main/Dataset/Raw_dataset.csv"
data = pd.read_csv(url)

print("Dataset loaded successfully!")
print("Shape (rows, cols):", data.shape)

# quick check that Potability column exists
if 'Potability' not in data.columns:
    raise ValueError("The dataset does not contain a 'Potability' column. Check the CSV headers.")


In [None]:
# Sample rows and dataset basics
from IPython.display import display

print("Shape (rows, cols):", data.shape)
print("\nFirst 10 rows:")
display(data.head(10))

print("\nColumn names:")
print(list(data.columns))

print("\nData types:")
display(data.dtypes)

print("\nStatistical summary (numeric columns):")
display(data.describe().T)


In [None]:
# Number of attributes and feature list
num_rows, num_cols = data.shape
feature_columns = [c for c in data.columns if c != 'Potability']
num_features = len(feature_columns)

print(f"Number of instances (rows): {num_rows}")
print(f"Number of attributes (columns): {num_cols}  -> features: {num_features}, class label: 1 (Potability)")
print("\nFeature names:")
print(feature_columns)



In [None]:
# Class distribution
class_counts = data['Potability'].value_counts().sort_index()
class_percent = data['Potability'].value_counts(normalize=True).sort_index() * 100

print("Class counts (Potability):")
print(class_counts)
print("\nClass percentages:")
print(class_percent.round(2).astype(str) + " %")



In [None]:
# Missing value analysis
missing = data.isnull().sum()
print("Missing values per column:")
print(missing)
print("\nTotal missing values in dataset:", missing.sum())


## [4] Data Preprocessing

In **Phase 2**, we analyzed and preprocessed the raw dataset to prepare it for data mining. The main preprocessing steps applied include:

1. **Handling missing values** using median imputation for numeric attributes.
2. **Handling outliers** using IQR-based winsorization (capping extreme values).
3. **Normalization** of features using Min-Max scaling to the range [0, 1].
4. **Feature evaluation** using correlation analysis with a heatmap.
5. **Saving the final preprocessed dataset** as `Preprocessed_dataset.csv`.

The following code cell is the original Phase 2 notebook code, which performs both data analysis (plots, summaries) and preprocessing in one place. It defines the final preprocessed dataframe `df_scaled` that will be used in the next sections for classification and clustering.

In [None]:

# Water Potability Dataset - Data Analysis


# 1. Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (10,6)


# 2. Load Dataset

url = "https://raw.githubusercontent.com/lin-010/IT326-Water-Potability/refs/heads/main/Dataset/Raw_dataset.csv"
df = pd.read_csv(url)   


print(df.head())


# 3. Check Missing Values

print("\nMissing Values per Column:\n")
print(df.isnull().sum())


plt.figure(figsize=(8,5))
sns.heatmap(df.isnull(), cbar=False, cmap="viridis")
plt.title("Missing Values Heatmap")
plt.show()


# 4. Statistical Summary (Five-number summary)

print("\nStatistical Summary:\n")
print(df.describe())


# 5. Plot 1: Histogram - Variable Distributions

df.hist(figsize=(12,10), bins=20, color='skyblue')
plt.suptitle("Histograms of Numeric Attributes", fontsize=16)
plt.show()


# 6. Plot 2: Boxplot - Outliers Detection

plt.figure(figsize=(12,6))
sns.boxplot(data=df, orient="h")
plt.title("Boxplot for Detecting Outliers")
plt.show()


# 7. Plot 3: Countplot - Class Label Distribution

plt.figure(figsize=(6,4))
sns.countplot(x='Potability', data=df, palette='pastel')
plt.title("Class Label Distribution (Potability)")
plt.xlabel("Potability (0 = Not Drinkable, 1 = Drinkable)")
plt.ylabel("Count")
plt.show()


# 8. Plot 4: Scatter Plot - Relationship Example

plt.figure(figsize=(7,5))
sns.scatterplot(x='ph', y='Hardness', hue='Potability', data=df, alpha=0.7)
plt.title("Scatter Plot of pH vs Hardness by Potability")
plt.show()

# -----------------------------------------------------------
# 9. Brief Observations :

# - Missing values exist in several columns, so preprocessing (imputation) is required.
# - Histograms show that some features like 'ph' and 'Sulfate' are skewed.
# - Boxplots reveal outliers, especially in 'Sulfate' and 'Turbidity'.
# - Class label distribution is imbalanced (more non-drinkable samples).
# - Scatter plot shows weak correlation between pH and Hardness.

# Water Potability Dataset - Data Preprocessing
# 10. Data Preprocessing

# 10.1 Handle Missing Values (Imputation)


df_filled = df.fillna(df.median())
print("\n Missing values handled using median imputation.\n")
print(df_filled.isnull().sum())


# 10.2 Handle Outliers 
Q1 = df_filled.quantile(0.25)
Q3 = df_filled.quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Cap outliers 
df_no_outliers = df_filled.clip(lower=lower_bound, upper=upper_bound, axis=1)

print("\n Outliers handled using IQR method.\n")


# 10.3 Normalization 
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df_no_outliers.drop('Potability', axis=1))

df_scaled = pd.DataFrame(scaled_data, columns=df_no_outliers.columns[:-1])
df_scaled['Potability'] = df_no_outliers['Potability']

print("\n Data normalized using Min-Max scaling.\n")
print(df_scaled.head())


# 10.4 Feature Selection (Correlation Analysis)
plt.figure(figsize=(10,6))
corr_matrix = df_scaled.corr()
sns.heatmap(corr_matrix, annot=False, cmap="coolwarm")
plt.title("Correlation Matrix Heatmap")
plt.show()

# Drop low-correlation features 
low_corr_features = corr_matrix['Potability'][abs(corr_matrix['Potability']) < 0.05].index
print("Low correlation features (optional to drop):", list(low_corr_features))


# 10.5 Save Preprocessed Data
df_scaled.to_csv("Preprocessed_dataset.csv", index=False)
print("\n Preprocessed dataset saved as 'Preprocessed_dataset.csv'\n")


# -----------------------------------------------------------
# 11. Summary of Preprocessing Steps

print("""
Preprocessing Summary:
1. Missing values handled with median imputation.
2. Outliers capped using IQR-based winsorization.
3. Data normalized using Min-Max Scaling (0–1 range).
4. Features evaluated using correlation analysis.
5. Final dataset saved for Phase 3 (classification & clustering).
""")

# Data snapshots:


print("Snapshot of Raw Dataset (before preprocessing):")
print(df.head())


print("\nSnapshot of Preprocessed Dataset (after preprocessing):")
print(df_scaled.head())


## [5] Data Mining Technique

In this section, we apply the data mining techniques to the **preprocessed dataset** (`df_scaled` from Section [4]).

### 5.1 Classification – Decision Trees

We use the `DecisionTreeClassifier` from the `sklearn.tree` package with two different splitting criteria:

- `criterion='gini'`  → Gini index (default)
- `criterion='entropy'`  → Information Gain (entropy)

For each criterion, we evaluate three different train–test splits:

- 90% training – 10% testing
- 80% training – 20% testing
- 70% training – 30% testing

For every configuration, we compute the **accuracy** and **confusion matrix**, and then identify the best-performing model.

In [None]:
# === Classification using Decision Trees ===
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

# Ensure df_scaled exists (from preprocessing section)
try:
    df_scaled
except NameError:
    raise NameError("df_scaled is not defined. Please run the preprocessing cell in Section [4] first.")

target_col = "Potability"
if target_col not in df_scaled.columns:
    raise ValueError(f"Target column '{target_col}' not found in df_scaled.")

# Features and target
X = df_scaled.drop(columns=[target_col])
y = df_scaled[target_col]

print("Preprocessed features shape:", X.shape)
print("Target shape:", y.shape)
print("\nClass distribution in preprocessed data:")
print(y.value_counts())

# Train-test splits and criteria
splits = [
    (0.9, 0.1),
    (0.8, 0.2),
    (0.7, 0.3),
]
criteria = ["gini", "entropy"]

results_clf = []

for train_size, test_size in splits:
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        train_size=train_size,
        stratify=y,
        random_state=42
    )
    
    for crit in criteria:
        clf = DecisionTreeClassifier(
            criterion=crit,
            random_state=42
        )
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)

        acc = accuracy_score(y_test, y_pred)
        cm = confusion_matrix(y_test, y_pred)

        results_clf.append({
            "criterion": crit,
            "train_size": f"{int(train_size*100)}%",
            "test_size": f"{int(test_size*100)}%",
            "accuracy": acc,
            "confusion_matrix": cm,
            "model": clf,
            "X_test": X_test,
            "y_test": y_test
        })

# Build summary table of accuracies
summary_rows = [
    {
        "Criterion": r["criterion"],
        "Train %": r["train_size"],
        "Test %": r["test_size"],
        "Accuracy": round(r["accuracy"], 4)
    }
    for r in results_clf
]

clf_results_df = pd.DataFrame(summary_rows)
print("\nDecision Tree classification results:")
display(clf_results_df)

# Plot confusion matrices
for r in results_clf:
    cm = r["confusion_matrix"]
    clf = r["model"]
    disp = ConfusionMatrixDisplay(
        confusion_matrix=cm,
        display_labels=clf.classes_
    )
    disp.plot()
    plt.title(
        f"Confusion Matrix – criterion={r['criterion']}, "
        f"train={r['train_size']}, test={r['test_size']}"
    )
    plt.show()

# Find best model by accuracy
best_model = max(results_clf, key=lambda r: r["accuracy"])
best_clf = best_model["model"]

print("\nBest Decision Tree model:")
print("  Criterion:", best_model["criterion"])
print("  Train size:", best_model["train_size"])
print("  Test size:", best_model["test_size"])
print("  Accuracy:", round(best_model["accuracy"], 4))

# Visualize the best tree
plt.figure(figsize=(16, 8))
plot_tree(
    best_clf,
    feature_names=X.columns,
    class_names=["Not potable", "Potable"],
    filled=True,
    fontsize=8
)
plt.title("Decision Tree – Best Model")
plt.show()


### 5.2 Clustering – K-Means

For clustering, we use the `KMeans` algorithm from the `sklearn.cluster` package. We apply clustering on the **feature space only** (without the `Potability` label) and standardize the features before applying K-Means.

We evaluate several values of K (number of clusters), for example:
- K = 2
- K = 3
- K = 4
- K = 5

For each K, we compute:
- **Average Silhouette coefficient**
- **Total within-cluster sum of squares (inertia)**

These metrics are then used to select the best K according to the majority rule (considering both silhouette and elbow method). We also visualize the clusters in 2D using two selected features.

In [None]:
# === Clustering using K-Means ===
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Use feature columns (drop target)
X_clust = df_scaled.drop(columns=[target_col])

# Standardize features for K-Means
scaler = StandardScaler()
X_scaled_clust = scaler.fit_transform(X_clust)

print("Clustering feature matrix shape:", X_scaled_clust.shape)

ks = [2, 3, 4, 5]
silhouette_scores = []
inertias = []
kmodels = {}

for k in ks:
    kmeans = KMeans(
        n_clusters=k,
        n_init=10,
        random_state=42
    )
    labels = kmeans.fit_predict(X_scaled_clust)

    inertia = kmeans.inertia_
    inertias.append(inertia)

    sil_score = silhouette_score(X_scaled_clust, labels)
    silhouette_scores.append(sil_score)

    kmodels[k] = {
        "model": kmeans,
        "labels": labels,
        "silhouette": sil_score,
        "inertia": inertia
    }

# Summary table for clustering
clust_summary = pd.DataFrame({
    "K": ks,
    "Average Silhouette": np.round(silhouette_scores, 4),
    "Total Within-Cluster SS (Inertia)": np.round(inertias, 2)
})

print("\nK-Means clustering summary:")
display(clust_summary)

# Elbow plot
plt.figure(figsize=(6,4))
plt.plot(ks, inertias, marker="o")
plt.xlabel("Number of clusters K")
plt.ylabel("Total within-cluster sum of squares (Inertia)")
plt.title("Elbow Method for K-Means")
plt.grid(True)
plt.show()

# Silhouette plot
plt.figure(figsize=(6,4))
plt.plot(ks, silhouette_scores, marker="o")
plt.xlabel("Number of clusters K")
plt.ylabel("Average Silhouette Coefficient")
plt.title("Silhouette Coefficient for different K")
plt.grid(True)
plt.show()

# Choose best K based on silhouette
best_k_index = int(np.argmax(silhouette_scores))
best_k = ks[best_k_index]
print("Best K based on silhouette score:", best_k)

# Simple 2D visualization using first two original features
feature_names = X_clust.columns
feat_x = feature_names[0]
feat_y = feature_names[1]

for k in ks:
    labels = kmodels[k]["labels"]
    plt.figure(figsize=(6,4))
    plt.scatter(
        X_clust[feat_x],
        X_clust[feat_y],
        c=labels,
        s=10
    )
    plt.xlabel(feat_x)
    plt.ylabel(feat_y)
    plt.title(f"K-Means Clusters (K={k}) using {feat_x} vs {feat_y}")
    plt.show()


## [6] Evaluation and Comparison

In this section, we summarize and compare the results of the data mining techniques.

### 6.1 Classification

The table `clf_results_df` (printed above) summarizes the accuracy for each combination of:
- Splitting criterion (Gini vs Entropy)
- Train–test partition (90–10, 80–20, 70–30)

From this table, we can identify:
- The best-performing **criterion**.
- The best **train–test split**.
- The overall best model (also visualized by the final Decision Tree).

### 6.2 Clustering

The table `clust_summary` contains, for each K:
- Average silhouette coefficient
- Total within-cluster sum of squares (inertia)

Using the silhouette and elbow plots, we can select the best K and interpret how the clusters separate water samples with different attribute values.

In [None]:
# Re-display the main evaluation tables (optional)

print("=== Classification Accuracy Summary ===")
try:
    display(clf_results_df)
except NameError:
    print("clf_results_df is not defined. Please run the classification cell in Section 5.1.")

print("\n=== Clustering Summary ===")
try:
    display(clust_summary)
except NameError:
    print("clust_summary is not defined. Please run the clustering cell in Section 5.2.")


## [7] Findings and Discussion

In this section, you should discuss the most important findings from both classification and clustering. For example:

- **Classification**
  - Which combination of criterion and train–test split achieved the best accuracy?
  - Are there signs of overfitting (very high accuracy on a small test split)?
  - Which attributes appear to be most important according to the Decision Tree structure?
  - How do your results compare (even roughly) with the results reported in the selected     research paper that used a similar dataset or problem?

- **Clustering**
  - What is the best K according to the silhouette and elbow methods?
  - How can you describe the clusters in terms of the water quality attributes.
  - Do some clusters tend to contain more potable or more non-potable samples (even though     the label was not used during training)?

Finally, summarize which technique (classification or clustering) is more useful for the stakeholders (e.g., water quality engineers) and what practical insights your project provides about water potability.


## [8] References

Below is an example list of references in IEEE style. You should update it to match the exact research paper(s) you used in your project.

1. A. Kadiwal, "Water Potability," *Kaggle Datasets*, 2020. [Online]. Available: <https://www.kaggle.com/datasets/adityakadiwal/water-potability>.

2. **[Replace with your main research paper]** Author(s), "Title of the paper," *Conference or Journal Name*, vol. X, no. Y, pp. Z–AA, Year.

3. F. Pedregosa *et al.*, "Scikit-learn: Machine Learning in Python," *Journal of Machine Learning Research*, vol. 12, pp. 2825–2830, 2011.
