# Dimensionality Reduction & Clustering Playground

This is a **hands-on learning notebook** for exploring high-dimensional data using:

- **PCA** (Principal Component Analysis)
- **t-SNE** (t-distributed Stochastic Neighbor Embedding)
- **UMAP** (Uniform Manifold Approximation and Projection)
- **Clustering** (KMeans, DBSCAN)

Goals:
- Build intuition for what PCA, t-SNE, and UMAP are *doing*.
- Understand when to use each method.
- See how clustering interacts with these embeddings.
- Give you a reusable playground for your own datasets (baseball, weather, Kaggle, etc.).


## 0. High-Level Workflow

1. **Load data**
   - Either from a CSV (your dataset) or from a built-in demo dataset.
2. **Basic EDA**
   - Shape, distributions, correlations.
3. **Preprocessing**
   - Select numeric features.
   - Standardize features.
4. **PCA**
   - Explained variance (scree plot).
   - 2D PCA scatter.
5. **t-SNE**
   - Nonlinear 2D embedding for visualization.
   - Parameters: `perplexity`, `learning_rate`.
6. **UMAP** (if available)
   - Nonlinear 2D embedding.
   - Parameters: `n_neighbors`, `min_dist`.
7. **Clustering**
   - KMeans on PCA and/or UMAP.
   - Elbow & silhouette heuristics.
   - DBSCAN for density-based clusters.
8. **Use in ML**
   - How to fold embeddings & cluster labels back into your ML pipeline.


In [None]:
# ========== 1. Imports & Config ==========

from pathlib import Path
from typing import Optional, Tuple

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.datasets import load_iris, make_blobs, make_circles

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
sns.set(style='whitegrid')
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['figure.dpi'] = 100

# UMAP is optional; we check availability.
try:
    import umap
    UMAP_AVAILABLE = True
except ImportError:
    UMAP_AVAILABLE = False
    print('UMAP is not installed. Install with `pip install umap-learn` to enable UMAP sections.')

# Random state for reproducibility
RANDOM_STATE = 42


## 2. Choose a Data Source

Two options:

1. **Your dataset (CSV)**
   - Set `DATA_MODE = 'csv'` and provide path + feature columns.
2. **Built-in demo dataset** (easier for quick experiments):
   - `iris` – classic 4D flower dataset (3 classes).
   - `blobs` – synthetic Gaussian clusters.
   - `circles` – concentric circles (nonlinear structure).

Advice:
- Start with a **demo** dataset to get a feel.
- Then plug in your **baseball stats, weather features, or Kaggle data** using CSV mode.


In [None]:
# ========== 2.1 Config: pick data mode ==========

DATA_MODE = 'iris'  # options: 'iris', 'blobs', 'circles', 'csv'

# If using CSV mode, set these:
DATA_DIR = Path('../input')
CSV_FILE = 'your_data.csv'  # change this

# If your CSV has a label column (e.g., position, cluster, etc.), set it here.
CSV_LABEL_COL: Optional[str] = None  # e.g. 'position', 'class', or None

# If None, we'll use all numeric columns. Otherwise, specify a list.
CSV_FEATURE_COLS: Optional[list] = None


In [None]:
# ========== 2.2 Helper: load data depending on DATA_MODE ==========

def load_demo_data(mode: str) -> Tuple[pd.DataFrame, Optional[pd.Series]]:
    """Return (X_df, y_series or None) for demo datasets."""
    if mode == 'iris':
        iris = load_iris()
        X = pd.DataFrame(iris.data, columns=iris.feature_names)
        y = pd.Series(iris.target, name='class')
        return X, y
    elif mode == 'blobs':
        X_array, y_array = make_blobs(n_samples=800, centers=4, n_features=5,
                                     random_state=RANDOM_STATE, cluster_std=1.2)
        X = pd.DataFrame(X_array, columns=[f'feat_{i}' for i in range(X_array.shape[1])])
        y = pd.Series(y_array, name='cluster')
        return X, y
    elif mode == 'circles':
        X_array, y_array = make_circles(n_samples=800, factor=0.5, noise=0.05,
                                       random_state=RANDOM_STATE)
        X = pd.DataFrame(X_array, columns=['x1', 'x2'])
        y = pd.Series(y_array, name='circle')
        return X, y
    else:
        raise ValueError(f'Unknown demo mode: {mode}')


def load_csv_data(
    data_dir: Path = DATA_DIR,
    csv_file: str = CSV_FILE,
    label_col: Optional[str] = CSV_LABEL_COL,
    feature_cols: Optional[list] = CSV_FEATURE_COLS,
) -> Tuple[pd.DataFrame, Optional[pd.Series]]:
    path = data_dir / csv_file
    if not path.exists():
        raise FileNotFoundError(f'CSV file not found: {path}')
    df = pd.read_csv(path)
    print('Loaded CSV shape:', df.shape)
    display(df.head())

    y = None
    if label_col is not None and label_col in df.columns:
        y = df[label_col]
    
    if feature_cols is None:
        X = df.select_dtypes(include=[np.number])
        if label_col is not None and label_col in X.columns:
            X = X.drop(columns=[label_col])
    else:
        missing = [c for c in feature_cols if c not in df.columns]
        if missing:
            raise ValueError(f'Feature columns not in CSV: {missing}')
        X = df[feature_cols]

    return X, y


# ---- Load data depending on mode ----
if DATA_MODE in ['iris', 'blobs', 'circles']:
    X, y = load_demo_data(DATA_MODE)
    print(f'Using demo dataset: {DATA_MODE}')
else:
    X, y = load_csv_data()
    print('Using CSV dataset.')

print('\nFeature matrix shape:', X.shape)
if y is not None:
    print('Labels shape:', y.shape, '| n classes (approx):', y.nunique())


## 3. Quick EDA on Features

We'll:
- Inspect basic statistics.
- Look at distributions of a few features.
- (Optionally) check pairwise correlations.

This step is about **building intuition** about the space we're going to compress.


In [None]:
# ========== 3.1 Basic stats ==========

display(X.head())
display(X.describe())


In [None]:
# ========== 3.2 Histograms for a subset of features ==========

num_cols = list(X.columns)
max_cols = min(6, len(num_cols))

fig, axes = plt.subplots(2, max_cols // 2 if max_cols > 2 else max_cols, figsize=(16, 6))
axes = np.array(axes).reshape(-1)
for ax, col in zip(axes, num_cols[:max_cols]):
    ax.hist(X[col], bins=30)
    ax.set_title(col)
plt.suptitle('Feature histograms (subset)')
plt.tight_layout()
plt.show()


In [None]:
# ========== 3.3 Correlation heatmap (if not too many features) ==========

if X.shape[1] <= 20:
    corr = X.corr()
    plt.figure(figsize=(8, 6))
    sns.heatmap(corr, cmap='coolwarm', center=0)
    plt.title('Feature correlation heatmap')
    plt.show()
else:
    print('More than 20 features; skipping full correlation heatmap for readability.')


## 4. Standardize Features

Most dimensionality reduction and clustering methods assume features are on a comparable scale.
- **PCA** is especially sensitive to scale.
- **KMeans** uses Euclidean distance.

We'll standardize each feature to have roughly:
- Mean 0
- Standard deviation 1


In [None]:
# ========== 4. Standardize features ==========

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print('Scaled feature matrix shape:', X_scaled.shape)


## 5. PCA – Linear Dimensionality Reduction

PCA finds **orthogonal directions** (principal components) that explain the maximum variance.

Use PCA when:
- You want a **fast, linear** dimensionality reduction.
- You care about variance explained and interpretability.
- You want a **baseline embedding** before t-SNE/UMAP.


In [None]:
# ========== 5.1 Fit PCA and inspect explained variance ==========

pca = PCA(n_components=min(10, X_scaled.shape[1]))
X_pca = pca.fit_transform(X_scaled)

explained = pca.explained_variance_ratio_
print('Explained variance ratio by component:')
for i, v in enumerate(explained):
    print(f'PC{i+1}: {v:.3f}')

plt.plot(np.arange(1, len(explained) + 1), np.cumsum(explained), marker='o')
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
plt.title('PCA cumulative explained variance')
plt.grid(True)
plt.show()


In [None]:
# ========== 5.2 2D PCA scatter ==========

pc1, pc2 = X_pca[:, 0], X_pca[:, 1]

plt.figure(figsize=(8, 6))
if y is not None:
    scatter = plt.scatter(pc1, pc2, c=y, cmap='tab10', alpha=0.7)
    plt.legend(*scatter.legend_elements(), title='Classes', loc='best')
else:
    plt.scatter(pc1, pc2, alpha=0.7)

plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('PCA (2D)')
plt.show()


**Interpretation tips:**

- Separated blobs in PC1/PC2 space suggest linear structure you can exploit.
- Strong overlap may hint at more complex / nonlinear structure.
- For interpretability, you can inspect PCA loadings (how each feature contributes to each component).


## 6. t-SNE – Nonlinear Local Structure Visualization

t-SNE is best used **only for visualization**, not as a feature generator for downstream models.

Key properties:
- Preserves **local neighborhood** structure.
- Can reveal clusters that are not linearly separable.
- Sensitive to **hyperparameters** and randomness.

Main hyperparameters:
- `perplexity` (how many neighbors each point sees; often 5–50).
- `learning_rate` (often 50–1000; start with 'auto').
- `n_iter` (start around 1000–2000).


In [None]:
# ========== 6.1 t-SNE 2D embedding ==========

tsne = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate='auto',
    init='pca',
    random_state=RANDOM_STATE,
    n_iter=1500,
)
X_tsne = tsne.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
if y is not None:
    scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='tab10', alpha=0.7)
    plt.legend(*scatter.legend_elements(), title='Classes', loc='best')
else:
    plt.scatter(X_tsne[:, 0], X_tsne[:, 1], alpha=0.7)

plt.title('t-SNE (2D)')
plt.xlabel('Dim 1')
plt.ylabel('Dim 2')
plt.show()


**t-SNE notes:**

- Different runs with different seeds/perplexity can look quite different.
- Distances between **well-separated clusters** are not very meaningful.
- Great for:
  - Visualizing embeddings from neural nets (NLP, CV).
  - Inspecting cluster structure in high-dimensional tabular data.


## 7. UMAP – Nonlinear Local + Global Structure

UMAP is similar in spirit to t-SNE but often:
- Faster and more scalable.
- Preserves more of the **global** structure.
- Works well as a **preprocessing step** for clustering or downstream models.

Key hyperparameters:
- `n_neighbors`: balance local vs global (5–50; lower → more local, higher → more global).
- `min_dist`: how tightly points are packed (smaller → tighter clusters).


In [None]:
# ========== 7.1 UMAP 2D embedding (if available) ==========

if UMAP_AVAILABLE:
    reducer = umap.UMAP(
        n_neighbors=15,
        min_dist=0.1,
        n_components=2,
        random_state=RANDOM_STATE,
    )
    X_umap = reducer.fit_transform(X_scaled)

    plt.figure(figsize=(8, 6))
    if y is not None:
        scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='tab10', alpha=0.7)
        plt.legend(*scatter.legend_elements(), title='Classes', loc='best')
    else:
        plt.scatter(X_umap[:, 0], X_umap[:, 1], alpha=0.7)

    plt.title('UMAP (2D)')
    plt.xlabel('UMAP1')
    plt.ylabel('UMAP2')
    plt.show()
else:
    print('UMAP not available; skipping UMAP embedding.')


**PCA vs t-SNE vs UMAP (mental model):**

- PCA: linear, fast, interpretable → **baseline** / feature engineering.
- t-SNE: nonlinear, local neighborhood focus → **pretty 2D cluster visualizations**, not for features.
- UMAP: nonlinear, can be used as **features or visualization**, better global+local balance.


## 8. Clustering on Embeddings

We’ll explore:

- **KMeans**: partitions data into k clusters using distances.
  - Works best with spherical-ish clusters.
- **DBSCAN**: density-based, can find arbitrarily shaped clusters and label outliers.

Common patterns:
- Run KMeans on **PCA** or **UMAP** embeddings instead of raw features.
- Use **elbow plot** and **silhouette score** to pick k.


In [None]:
# ========== 8.1 Helper: run KMeans for multiple k on PCA ==========

k_values = [2, 3, 4, 5, 6]
inertias = []
silhouettes = []

for k in k_values:
    km = KMeans(n_clusters=k, random_state=RANDOM_STATE)
    labels_k = km.fit_predict(X_pca[:, :2])  # use first 2 PCs for simplicity
    inertias.append(km.inertia_)
    # silhouette needs at least 2 clusters and more samples than clusters
    if len(np.unique(labels_k)) > 1 and X_pca.shape[0] > len(np.unique(labels_k)):
        sil = silhouette_score(X_pca[:, :2], labels_k)
    else:
        sil = np.nan
    silhouettes.append(sil)

print('k | inertia | silhouette')
for k, inn, sil in zip(k_values, inertias, silhouettes):
    print(f'{k:2d} | {inn:8.1f} | {sil:9.3f}')

fig, ax1 = plt.subplots()
ax1.plot(k_values, inertias, marker='o')
ax1.set_xlabel('k')
ax1.set_ylabel('Inertia', color='b')
ax1.tick_params(axis='y', labelcolor='b')

ax2 = ax1.twinx()
ax2.plot(k_values, silhouettes, marker='s')
ax2.set_ylabel('Silhouette', color='g')
ax2.tick_params(axis='y', labelcolor='g')

plt.title('KMeans on PCA(2D): inertia & silhouette vs k')
plt.show()


**Choosing k:**

- Look for an **elbow** in inertia (where improvement slows).
- Prefer k where **silhouette** is relatively high.
- Also consider domain meaning (e.g., player archetypes, weather regimes).


In [None]:
# ========== 8.2 Fit KMeans with chosen k and visualize clusters ==========

BEST_K = 3  # adjust after inspecting previous outputs

kmeans_final = KMeans(n_clusters=BEST_K, random_state=RANDOM_STATE)
cluster_labels_pca = kmeans_final.fit_predict(X_pca[:, :2])

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels_pca, cmap='tab10', alpha=0.7)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title(f'KMeans (k={BEST_K}) clusters in PCA space')
plt.colorbar(scatter, label='Cluster')
plt.show()

# If t-SNE/UMAP are available, color those by the same clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=cluster_labels_pca, cmap='tab10', alpha=0.7)
plt.title('Same KMeans clusters visualized in t-SNE space')
plt.xlabel('t-SNE1')
plt.ylabel('t-SNE2')
plt.show()

if UMAP_AVAILABLE:
    plt.figure(figsize=(8, 6))
    plt.scatter(X_umap[:, 0], X_umap[:, 1], c=cluster_labels_pca, cmap='tab10', alpha=0.7)
    plt.title('Same KMeans clusters visualized in UMAP space')
    plt.xlabel('UMAP1')
    plt.ylabel('UMAP2')
    plt.show()


In [None]:
# ========== 8.3 DBSCAN for density-based clustering ==========

# DBSCAN finds clusters of high density and marks sparse points as noise (-1)

dbscan = DBSCAN(eps=0.8, min_samples=10)
db_labels = dbscan.fit_predict(X_pca[:, :2])  # using PCA(2D) for simplicity

unique_labels = np.unique(db_labels)
print('DBSCAN unique labels:', unique_labels)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=db_labels, cmap='tab20', alpha=0.7)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.title('DBSCAN clusters in PCA space (label -1 = noise)')
plt.colorbar(scatter, label='DBSCAN label')
plt.show()


**KMeans vs DBSCAN:**

- **KMeans**:
  - You choose k.
  - Good for roughly spherical clusters of similar size.
  - Fast and widely used.

- **DBSCAN**:
  - Finds "natural" clusters based on density.
  - Can discover arbitrarily shaped clusters.
  - Marks outliers explicitly.
  - Sensitive to `eps` and `min_samples`.


## 9. Using Embeddings & Clusters in ML Pipelines

Once you have PCA/UMAP/t-SNE (mostly PCA/UMAP) and cluster labels, you can:

- **Add them as features** to your supervised ML models.
  - PCA components: `pc1, pc2, pc3, ...`.
  - UMAP components: `umap1, umap2, ...`.
  - Cluster labels: one-hot encode `cluster_id`.
- Use clusters to:
  - Create **player archetypes** (e.g., power hitter vs speedster).
  - Segment customers or weather regimes.
  - Perform separate models per cluster if behavior is very different.


In [None]:
# ========== 9.1 Build a combined DataFrame with embeddings & cluster labels ==========

df_features = X.copy()

df_features['pc1'] = X_pca[:, 0]
df_features['pc2'] = X_pca[:, 1]
df_features['kmeans_cluster'] = cluster_labels_pca

df_features['tsne1'] = X_tsne[:, 0]
df_features['tsne2'] = X_tsne[:, 1]

if UMAP_AVAILABLE:
    df_features['umap1'] = X_umap[:, 0]
    df_features['umap2'] = X_umap[:, 1]

if y is not None:
    df_features['label'] = y.values

display(df_features.head())


At this point you can:

- Export `df_features` to CSV and feed it to your **regression/classification templates**.
- Use PCA/UMAP components as **compressed representations** of your original features.
- Treat `kmeans_cluster` as a categorical feature in downstream models.


## 10. Practical Decision Rules (Quick Reference)

### 10.1 Which dimensionality reduction method?

- **Start with PCA** when:
  - You want a quick, linear baseline.
  - You care about explained variance and feature contributions.
  - You want input to KMeans or other models.

- **Use t-SNE** when:
  - You specifically want a **2D plot** for exploration.
  - You don’t care about using the embedding as input features.
  - You’re exploring complex embeddings (e.g., from deep models).

- **Use UMAP** when:
  - You want a **nonlinear embedding** that you *can* use as features.
  - You care about both local and some global structure.
  - You need scalability to larger datasets.

### 10.2 Which clustering method?

- **KMeans**:
  - You have a rough idea of the number of clusters.
  - Clusters are somewhat spherical/convex in embedding space.
  - You want something fast and simple.

- **DBSCAN**:
  - You expect irregular cluster shapes.
  - You want outlier detection built in.
  - You don’t know how many clusters to expect.

Always cross-check clusters with **domain knowledge**:
- Do the discovered clusters make sense for players, customers, or weather regimes?
- Are they stable across random seeds and reasonable parameter changes?
