# Unsupervised Template ‚Äì Clustering & Dimensionality Reduction

This notebook is a reusable template for **unsupervised tabular problems**, focused on:

- **Clustering** (KMeans, hierarchical, DBSCAN)
- **Dimensionality Reduction** (PCA, optional t-SNE/UMAP)
- Understanding structure **without labels**

You can copy this notebook into any project where you want to:
- Discover natural groups (segments, player archetypes, customer cohorts)
- Visualize high-dimensional data in 2D
- Build features for downstream supervised models.

---

## üîÅ High-Level Workflow (Unsupervised)

1. **Imports & config**
2. **Load data**
3. **Column typing & selection of features for unsupervised work**
4. **Basic EDA (without target)**
5. **Scaling & Dimensionality Reduction**
   - PCA (core)
   - t-SNE / UMAP (optional, slower but nicer visually)
6. **Clustering**
   - KMeans (baseline)
   - Optional: hierarchical / DBSCAN
7. **Cluster Evaluation & Interpretation**
   - Elbow / silhouette
   - Cluster profiles (feature means per cluster)
8. **Save cluster assignments & embeddings** for downstream use


In [None]:
# ========== 1. Imports & Config (Unsupervised: Clustering + DimRed) ==========

import os
from pathlib import Path
from typing import Optional, List

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score

pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
plt.rcParams["figure.dpi"] = 100

# ---- Config (edit per dataset) ----
DATA_DIR = Path("../input")
DATA_FILE = "data.csv"       # change to your dataset
ID_COL = "id"                # optional, set to None if not applicable

RANDOM_STATE = 42

# Which columns to use for unsupervised analysis
# - You can leave as None to auto-detect numeric columns
UNSUPERVISED_FEATURES: Optional[List[str]] = None


In [None]:
# ========== 2. Load Data & Helper Functions ==========

def load_data(data_dir: Path = DATA_DIR, data_file: str = DATA_FILE) -> pd.DataFrame:
    path = data_dir / data_file
    if not path.exists():
        raise FileNotFoundError(f"Data file not found: {path}")
    df = pd.read_csv(path)
    print("Data shape:", df.shape)
    display(df.head())
    return df


def get_numeric_features(df: pd.DataFrame, exclude: Optional[List[str]] = None) -> List[str]:
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    if exclude:
        num_cols = [c for c in num_cols if c not in exclude]
    return num_cols


def summarize_dataframe(df: pd.DataFrame, name: str = "df"):
    print(f"===== {name} summary =====")
    print("Shape:", df.shape)
    display(df.head())
    print("\nDtypes:")
    display(df.dtypes)
    print("\nMissing (%):")
    display((df.isna().mean() * 100).sort_values(ascending=False))


df = load_data()
summarize_dataframe(df, "df")


### 3Ô∏è‚É£ Choose Features for Unsupervised Analysis

For clustering and dimensionality reduction, we usually:

- Focus on **numeric features** (or encoded versions of categoricals)
- Exclude identifiers (`ID_COL`) and any leakage-like columns (targets, obvious labels)
- Optionally, apply some domain-based filtering (e.g., only skill metrics for players)

You can either:
- Let the notebook auto-pick numeric columns, or
- Manually set `UNSUPERVISED_FEATURES` in the config block.


In [None]:
# Determine which columns to use
exclude_cols = []
if ID_COL is not None and ID_COL in df.columns:
    exclude_cols.append(ID_COL)

if UNSUPERVISED_FEATURES is None:
    feature_cols = get_numeric_features(df, exclude=exclude_cols)
    print("Auto-selected numeric features:", feature_cols)
else:
    feature_cols = [c for c in UNSUPERVISED_FEATURES if c in df.columns]
    print("Using configured features:", feature_cols)

X_raw = df[feature_cols].copy()


### 4Ô∏è‚É£ Basic EDA Without Labels

Even without a target, we can:

- Look at distributions of key features
- Check correlations between features
- Spot obvious scaling differences (some features 0‚Äì1, others 0‚Äì10,000)

This informs scaling decisions and whether PCA is likely to be meaningful.


In [None]:
# Histograms for a sample of features
sample_feats = feature_cols[:10]  # adjust or slice more
X_raw[sample_feats].hist(bins=30, figsize=(14, 8))
plt.suptitle("Feature distributions (subset)", y=1.02)
plt.show()

# Correlation heatmap (subset if many features)
corr_sample_cols = feature_cols[:20]
corr = X_raw[corr_sample_cols].corr()
sns.heatmap(corr, cmap="coolwarm", center=0)
plt.title("Correlation heatmap (subset of features)")
plt.show()


### 5Ô∏è‚É£ Scaling & PCA (Core Dimensionality Reduction)

Most clustering algorithms (especially KMeans) are **distance-based**, so scaling matters.

We will:

1. Apply **StandardScaler** to numeric features
2. Fit **PCA** to capture main variance directions
3. Inspect explained variance to choose number of components
4. Create a 2D PCA embedding for visualization & clustering

You can later replace/augment this with t-SNE/UMAP for nicer visuals.


In [None]:
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_raw)
print("Scaled shape:", X_scaled.shape)

# PCA
pca = PCA(n_components=min(20, X_scaled.shape[1]))
X_pca = pca.fit_transform(X_scaled)
explained_var = pca.explained_variance_ratio_
print("Explained variance ratio (first 10):", explained_var[:10])
print("Cumulative explained variance (first 10):", np.cumsum(explained_var[:10]))

plt.plot(np.cumsum(explained_var))
plt.xlabel("Number of components")
plt.ylabel("Cumulative explained variance")
plt.title("PCA ‚Äì Cumulative explained variance")
plt.grid(True)
plt.show()

# 2D PCA for visualization
X_pca2 = X_pca[:, :2]
pca_df = pd.DataFrame(X_pca2, columns=["PC1", "PC2"])
pca_df.head()


### (Optional) t-SNE / UMAP for Nonlinear Structure

PCA is linear. For more complex manifolds, you can try:

- **t-SNE** (good for local structure, small/medium datasets)
- **UMAP** (often faster and preserves both local/global structure)

Both are mainly for **visualization**, not for modeling directly.

Below is an optional t-SNE block (can be slow on large data).

In [None]:
RUN_TSNE = False  # set to True if you want to run t-SNE (may be slow)

tsne_df = None
if RUN_TSNE:
    tsne = TSNE(
        n_components=2,
        perplexity=30,
        learning_rate="auto",
        init="pca",
        random_state=RANDOM_STATE,
    )
    X_tsne = tsne.fit_transform(X_scaled)
    tsne_df = pd.DataFrame(X_tsne, columns=["TSNE1", "TSNE2"])
    plt.scatter(tsne_df["TSNE1"], tsne_df["TSNE2"], s=5, alpha=0.7)
    plt.title("t-SNE embedding (no clusters yet)")
    plt.show()


### 6Ô∏è‚É£ KMeans Clustering on PCA Embedding

We start with **KMeans** as a baseline clustering algorithm.

Steps:

1. Choose a range for `k` (number of clusters)
2. Fit KMeans for each `k` on PCA-reduced data
3. Inspect:
   - Inertia (elbow method)
   - Silhouette score (cluster separation)
4. Pick a reasonable `k` and refit

We cluster on the 2D or higher-dimensional PCA space (e.g., first 10 PCs) instead of raw features to reduce noise and speed up clustering.


In [None]:
# We'll cluster on the first N PCA components (not just 2D)
N_PCS_FOR_CLUSTERING = min(10, X_pca.shape[1])
X_pca_for_clustering = X_pca[:, :N_PCS_FOR_CLUSTERING]

k_range = range(2, 11)
inertias = []
sil_scores = []

for k in k_range:
    km = KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init="auto")
    labels = km.fit_predict(X_pca_for_clustering)
    inertias.append(km.inertia_)
    sil = silhouette_score(X_pca_for_clustering, labels)
    sil_scores.append(sil)

fig, ax1 = plt.subplots()
ax1.plot(list(k_range), inertias, marker="o")
ax1.set_xlabel("k (number of clusters)")
ax1.set_ylabel("Inertia", color="tab:blue")
ax1.tick_params(axis="y", labelcolor="tab:blue")
ax1.set_title("KMeans: Inertia & Silhouette vs k")

ax2 = ax1.twinx()
ax2.plot(list(k_range), sil_scores, marker="s", color="tab:red")
ax2.set_ylabel("Silhouette score", color="tab:red")
ax2.tick_params(axis="y", labelcolor="tab:red")

plt.show()

print("k values:", list(k_range))
print("Inertias:", inertias)
print("Silhouette scores:", sil_scores)


In [None]:
# Choose k based on elbow/silhouette (edit this)
BEST_K = 4

kmeans_final = KMeans(n_clusters=BEST_K, random_state=RANDOM_STATE, n_init="auto")
cluster_labels = kmeans_final.fit_predict(X_pca_for_clustering)

pca_df["cluster"] = cluster_labels

plt.scatter(pca_df["PC1"], pca_df["PC2"], c=pca_df["cluster"], cmap="tab10", s=10, alpha=0.8)
plt.title(f"KMeans clustering (k={BEST_K}) on 2D PCA")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.colorbar(label="Cluster")
plt.show()

# Attach clusters back to original df
df_with_clusters = df.copy()
df_with_clusters["cluster"] = cluster_labels
display(df_with_clusters.head())


### 7Ô∏è‚É£ Cluster Profiles & Interpretation

To understand what each cluster represents, we:

- Compute **mean/median** of each feature per cluster
- Look for patterns (e.g., cluster 0 = high value customers, cluster 1 = low activity)
- Optionally, visualize distributions per cluster


In [None]:
cluster_summary = df_with_clusters.groupby("cluster")[feature_cols].mean().T
display(cluster_summary)

plt.figure(figsize=(12, 6))
sns.heatmap(cluster_summary, cmap="coolwarm", center=0)
plt.title("Cluster feature means (standard scale)")
plt.show()


### (Optional) Other Clustering Methods

Once KMeans is working, you can try:

- **AgglomerativeClustering** (hierarchical):
  - `AgglomerativeClustering(n_clusters=BEST_K, linkage="ward")`
- **DBSCAN** for density-based clusters (no need to choose k):
  - `DBSCAN(eps=0.5, min_samples=5)`

These can capture shapes and densities that KMeans misses.


In [None]:
RUN_EXTRA_CLUSTERING = False

if RUN_EXTRA_CLUSTERING:
    # Hierarchical
    agg = AgglomerativeClustering(n_clusters=BEST_K)
    labels_agg = agg.fit_predict(X_pca_for_clustering)
    plt.scatter(pca_df["PC1"], pca_df["PC2"], c=labels_agg, cmap="tab10", s=10, alpha=0.8)
    plt.title("Agglomerative clustering on 2D PCA")
    plt.show()

    # DBSCAN
    db = DBSCAN(eps=0.5, min_samples=10)
    labels_db = db.fit_predict(X_pca_for_clustering)
    plt.scatter(pca_df["PC1"], pca_df["PC2"], c=labels_db, cmap="tab20", s=10, alpha=0.8)
    plt.title("DBSCAN clustering on 2D PCA (noise = -1)")
    plt.show()


### 8Ô∏è‚É£ Saving Embeddings & Cluster Assignments

You typically want to **reuse**:

- `df_with_clusters` (original data + cluster labels) as features in supervised models
- `pca_df` (2D embedding) for visualization
- `X_pca` (higher-dimensional embedding) as an alternative feature space


In [None]:
OUTPUT_DIR = Path("./unsupervised_outputs")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

df_with_clusters.to_csv(OUTPUT_DIR / "data_with_clusters.csv", index=False)
pca_out = pca_df.copy()
pca_out.to_csv(OUTPUT_DIR / "pca_2d_with_clusters.csv", index=False)

print("Saved:")
print(" - data_with_clusters.csv")
print(" - pca_2d_with_clusters.csv")
