# The Project: Semi-Supervised Clustering with Hybrid Feature Pipeline (XGBoost Leaf Embeddings + Catch22 Features), Dimensionality Reduction (UMAP), and Agglomerative Clustering

## Overview

This project performs **semi-supervised clustering** on a phishing-website dataset stored in **ARFF** format. It builds a **hybrid feature space** by combining:

* **XGBoost leaf embeddings** (learned structure from gradient-boosted trees), and
* **Catch22** handcrafted descriptors (22 classic time-series/statistical features),

then reduces dimensionality with **UMAP** and clusters with **Agglomerative Clustering**. Model quality is reported using **5-fold cross-validation** with both supervised-style metrics (via post-hoc label mapping) and unsupervised indices (ARI, NMI).

---

## What is a "Hybrid Pipeline"?

A **hybrid pipeline** means the feature processing chain integrates **two fundamentally different feature generation approaches** before feeding them into dimensionality reduction and clustering:

1. **XGBoost Leaf Embeddings**

   * Trains an XGBoost classifier on the balanced training set.
   * Uses `.apply()` to get the leaf index of each sample in every tree.
   * One-Hot encodes these leaf indices to produce learned, high-dimensional embeddings that capture **complex, nonlinear relationships**.

2. **Catch22 Statistical Features**

   * Extracts 22 pre-defined time-series descriptors from each row.
   * These features are **model-agnostic**, hand-engineered, and robust to noise.

**Why combine them?**

* The statistical features contribute **general, domain-independent signals**.
* The learned leaf embeddings capture **problem-specific patterns** from supervised learning.
* The combined representation is richer and more discriminative than either source alone, helping **UMAP** create well-separated clusters and improving **Agglomerative Clustering** quality.

---

## Step-by-Step

1. **Load & Clean (ARFF)**

   * Reads `Training Dataset.arff` with `scipy.io.arff`.
   * Converts byte columns to strings; casts the target `Result` to integers and remaps labels `{-1 → 0, 1 → 1}`.

2. **Cross-Validation**

   * Uses `KFold(n_splits=5, shuffle=True, random_state=42)` to create train/test splits and aggregate metrics across folds.

3. **Balance & Scale (Train-Only)**

   * Applies `RandomOverSampler` to the **training split only** to balance classes.
   * Fits `StandardScaler` on the balanced train set; transforms both train and test.

4. **Feature Engineering (Two Streams)**

   * **Catch22**: Computes 22 statistical features per sample (`pycatch22`).
   * **XGBoost Leaf Embeddings**: Trains an `XGBClassifier` on balanced+scaled train data, applies `.apply()` to get **leaf indices** for train/test, then **One-Hot encodes** these indices.

5. **Fuse Features**

   * Concatenates `[One-Hot leaves ⨁ Catch22]` for train and test.

6. **Dimensionality Reduction**

   * Fits **UMAP** on the **train** fused features to 20 dimensions; transforms the **test** fused features with the same reducer.

7. **Clustering (Test Split)**

   * Runs **AgglomerativeClustering (Ward, `n_clusters=5`)** on test UMAP features to produce cluster IDs.

8. **Evaluation**

   * **Post-hoc label mapping**: maps clusters → labels using **majority vote** against the test ground truth (for reporting accuracy/F1).
   * Computes **classification report** (accuracy, precision, recall, F1), plus **Adjusted Rand Index (ARI)** and **Normalized Mutual Information (NMI)** on raw clusters.
   * Stores per-fold metrics and prints a final summary table and fold-wise means.

> **Note on interpretation:** The majority-vote mapping uses **test labels** to align cluster IDs with classes. This is standard for evaluating clustering but should be viewed as an external scoring step rather than a train-time supervision signal.

---

## Outputs You’ll See

* Per-fold **classification report** for mapped cluster labels.
* **ARI** and **NMI** per fold.
* A final **DataFrame** summarizing accuracy, macro-F1, ARI, and NMI for each fold, and the **mean** over folds.
<br><br><br><br>

# The dataset - Phishing Websites Dataset – UCI Machine Learning Repository

The Phishing Websites dataset, contributed by Rami Mohammad and Lee McCluskey in 2015, is designed for classifying websites as phishing or legitimate. It comprises 11,055 instances and 30 features, all represented as integers. The target variable, "Result," indicates the website's status: `1` for legitimate, `0` for suspicious, and `-1` for phishing. The dataset was sourced from PhishTank, MillerSmiles, and Google's search operators. Notably, it does not contain missing values. :contentReference[oaicite:10]{index=10}:contentReference[oaicite:11]{index=11}

Each feature in the dataset corresponds to specific attributes of a website's URL or domain, such as:​:contentReference[oaicite:14]{index=14}

- Presence of an IP address in the URL
- URL length
- Use of shortening services
- Presence of the "@" symbol
- Double slash redirection
- Prefix or suffix in the domain
- Subdomain presence
- SSL certificate status
- Domain registration length
- Favicon presence:contentReference[oaicite:35]{index=35}

These features are instrumental in distinguishing phishing websites from legitimate ones. For a detailed description of each feature, refer to the [Phishing Websites Features document](https://archive.ics.uci.edu/ml/machine-learning-databases/00327/Phishing%20Websites%20Features.docx). :contentReference[oaicite:40]{index=40}

This dataset is widely utilized in cybersecurity research and machine learning applications aimed at enhancing web security.:contentReference[oaicite:43]{index=43}

For more information or to download the dataset, visit the [UCI Machine Learning Repository's Phishing Websites page](https://archive.ics.uci.edu/dataset/327/phishing+websites).


In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install liac-arff --quiet

import arff
import pandas as pd
from sklearn.model_selection import train_test_split

# === Path to the ARFF file inside the extracted ZIP
arff_path = "/content/phishing_dataset/Training Dataset.arff"

# === Load ARFF
with open(arff_path, 'r') as f:
    dataset = arff.load(f)

# === Convert to DataFrame
df = pd.DataFrame(dataset['data'], columns=[attr[0] for attr in dataset['attributes']])

# === Convert target column (last column is usually 'Result')
df['Result'] = df['Result'].astype(int).map({1: 0, -1: 1})  # 0 = Legitimate, 1 = Phishing

# === Split to Train / Validation / Test
train_df, temp_df = train_test_split(df, test_size=0.4, stratify=df['Result'], random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, stratify=temp_df['Result'], random_state=42)

print(f"Train size: {train_df.shape}")
print(f"Validation size: {val_df.shape}")
print(f"Test size: {test_df.shape}")


  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for liac-arff (setup.py) ... [?25l[?25hdone
Train size: (6633, 31)
Validation size: (2211, 31)
Test size: (2211, 31)


In [None]:
# Display first 5 rows
print("Columns:", df.columns.tolist())
print("\nValue counts for 'Result':\n", df['Result'].value_counts())
df.head()

Columns: ['having_IP_Address', 'URL_Length', 'Shortining_Service', 'having_At_Symbol', 'double_slash_redirecting', 'Prefix_Suffix', 'having_Sub_Domain', 'SSLfinal_State', 'Domain_registeration_length', 'Favicon', 'port', 'HTTPS_token', 'Request_URL', 'URL_of_Anchor', 'Links_in_tags', 'SFH', 'Submitting_to_email', 'Abnormal_URL', 'Redirect', 'on_mouseover', 'RightClick', 'popUpWidnow', 'Iframe', 'age_of_domain', 'DNSRecord', 'web_traffic', 'Page_Rank', 'Google_Index', 'Links_pointing_to_page', 'Statistical_report', 'Result']

Value counts for 'Result':
 Result
 1    6157
-1    4898
Name: count, dtype: int64


Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,...,-1,1,-1,-1,0,-1,1,1,1,1


In [None]:
!pip install -U scikit-learn imbalanced-learn

In [None]:
import os
import numpy as np
import pandas as pd
from scipy.io import arff
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import classification_report, adjusted_rand_score, normalized_mutual_info_score
from sklearn.cluster import AgglomerativeClustering
from umap import UMAP
from scipy.stats import mode
from imblearn.over_sampling import RandomOverSampler
from xgboost import XGBClassifier
from pycatch22 import catch22_all
from sklearn.model_selection import KFold

# === Load dataset ===
data_path = "/content/phishing_dataset/Training Dataset.arff"
data, meta = arff.loadarff(data_path)
df = pd.DataFrame(data)
df = df.applymap(lambda x: x.decode() if isinstance(x, bytes) else x)
df["Result"] = df["Result"].astype(int)

# === Prepare X and y ===
X = df.drop(columns=["Result"]).astype(float)
y = df["Result"].astype(int)
y_mapped = y.replace({-1:0, 1:1}).values

# === Initialize KFold ===
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# === Function to map clusters to labels ===
def map_clusters_to_labels(true_labels, pred_clusters):
    mapping = {}
    for cluster_id in np.unique(pred_clusters):
        majority = mode(true_labels[pred_clusters == cluster_id], keepdims=True).mode[0]
        mapping[cluster_id] = majority
    return np.vectorize(mapping.get)(pred_clusters)

# === Store metrics per fold ===
fold_metrics = []

for fold, (train_idx, test_idx) in enumerate(kf.split(X), 1):
    print(f"\n=== Fold {fold} ===")

    # Split train/test
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y_mapped[train_idx], y_mapped[test_idx]

    # Oversample training only
    ros = RandomOverSampler(random_state=42)
    X_train_bal, y_train_bal = ros.fit_resample(X_train, y_train)
    X_train_bal = pd.DataFrame(X_train_bal, columns=X_train.columns)

    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_bal)
    X_test_scaled = scaler.transform(X_test)

    # Catch22 features
    catch22_train = np.array([catch22_all(row.values)["values"] for _, row in X_train_bal.iterrows()])
    catch22_test = np.array([catch22_all(row.values)["values"] for _, row in X_test.iterrows()])

    # XGBoost embedding - train classifier on balanced train data
    xgb = XGBClassifier(n_estimators=200, max_depth=4, learning_rate=0.1, eval_metric='logloss', random_state=42)
    xgb.fit(X_train_scaled, y_train_bal)
    train_leaves = xgb.apply(X_train_scaled)
    test_leaves = xgb.apply(X_test_scaled)

    encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
    train_leaf_onehot = encoder.fit_transform(train_leaves)
    test_leaf_onehot = encoder.transform(test_leaves)

    # Combine features
    X_train_combined = np.hstack([train_leaf_onehot, catch22_train])
    X_test_combined = np.hstack([test_leaf_onehot, catch22_test])

    # UMAP reduction (fit on train, transform test)
    umap_model = UMAP(n_components=20, n_neighbors=30, min_dist=0.0, random_state=42)
    X_train_reduced = umap_model.fit_transform(X_train_combined)
    X_test_reduced = umap_model.transform(X_test_combined)

    # Agglomerative clustering on test set
    agglo = AgglomerativeClustering(n_clusters=5, linkage='ward')
    clusters_test = agglo.fit_predict(X_test_reduced)

    # Map clusters to labels using test true labels
    mapped_test_preds = map_clusters_to_labels(y_test, clusters_test)

    # Evaluation
    report = classification_report(y_test, mapped_test_preds, output_dict=True, zero_division=0)
    ari = adjusted_rand_score(y_test, clusters_test)
    nmi = normalized_mutual_info_score(y_test, clusters_test)

    print(classification_report(y_test, mapped_test_preds, zero_division=0))
    print(f"ARI: {ari:.3f}")
    print(f"NMI: {nmi:.3f}")

    fold_metrics.append({
        "fold": fold,
        "accuracy": report["accuracy"],
        "macro_f1": report["macro avg"]["f1-score"],
        "ARI": ari,
        "NMI": nmi
    })

# === Summary ===
df_metrics = pd.DataFrame(fold_metrics)
print("\n=== K-Fold Cross Validation Summary ===")
print(df_metrics)
print("Mean metrics:")
print(df_metrics.mean())


  df = df.applymap(lambda x: x.decode() if isinstance(x, bytes) else x)



=== Fold 1 ===


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
  warn(


              precision    recall  f1-score   support

           0       0.92      0.90      0.91       956
           1       0.93      0.94      0.93      1255

    accuracy                           0.92      2211
   macro avg       0.92      0.92      0.92      2211
weighted avg       0.92      0.92      0.92      2211

ARI: 0.389
NMI: 0.436

=== Fold 2 ===


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
  warn(


              precision    recall  f1-score   support

           0       0.89      0.86      0.88       950
           1       0.90      0.92      0.91      1261

    accuracy                           0.89      2211
   macro avg       0.89      0.89      0.89      2211
weighted avg       0.89      0.89      0.89      2211

ARI: 0.325
NMI: 0.394

=== Fold 3 ===


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
  warn(


              precision    recall  f1-score   support

           0       0.92      0.85      0.88       997
           1       0.89      0.94      0.91      1214

    accuracy                           0.90      2211
   macro avg       0.90      0.90      0.90      2211
weighted avg       0.90      0.90      0.90      2211

ARI: 0.379
NMI: 0.413

=== Fold 4 ===


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
  warn(


              precision    recall  f1-score   support

           0       0.93      0.90      0.92       985
           1       0.92      0.95      0.93      1226

    accuracy                           0.93      2211
   macro avg       0.93      0.92      0.93      2211
weighted avg       0.93      0.93      0.93      2211

ARI: 0.431
NMI: 0.463

=== Fold 5 ===


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
  warn(


              precision    recall  f1-score   support

           0       0.91      0.86      0.89      1010
           1       0.89      0.93      0.91      1201

    accuracy                           0.90      2211
   macro avg       0.90      0.90      0.90      2211
weighted avg       0.90      0.90      0.90      2211

ARI: 0.357
NMI: 0.406

=== K-Fold Cross Validation Summary ===
   fold  accuracy  macro_f1       ARI       NMI
0     1  0.922659  0.921047  0.389057  0.436229
1     2  0.894618  0.891972  0.324911  0.394130
2     3  0.899593  0.897823  0.379238  0.413330
3     4  0.926278  0.925128  0.430683  0.463483
4     5  0.898688  0.897450  0.356836  0.405711
Mean metrics:
fold        3.000000
accuracy    0.908367
macro_f1    0.906684
ARI         0.376145
NMI         0.422577
dtype: float64
