
# Dimensionality Reduction — Hands‑On

**Course:** Data Analysis & Machine Learning for Physics  
**Focus:** PCA, SVD, t‑SNE, UMAP, Autoencoders (same dataset for all methods)  
**Dataset:** `sklearn.datasets.load_digits()` (1797 samples, 8×8 images → 64D feature vectors)

**What you'll do in ~60 minutes**
- Load and standardize the dataset (digits) — quick, reproducible, small but rich.
- Apply **PCA**: visualize 2‑D embedding, **explained variance ratio**, **cumulative variance**.
- Apply **SVD**: inspect **singular values (log‑scale)** and **cumulative energy**; connect to PCA.
- Apply **t‑SNE** and **UMAP**: study the effect of **hyperparameters** on neighborhood structure.
- Build a tiny **Autoencoder** with a 2‑D latent space and compare to PCA/t‑SNE/UMAP embeddings.

> This lab complements the slide deck *“Dimensionality Reduction: PCA, SVD, t‑SNE, UMAP, and Autoencoders.”*



## 0) Setup




## 1) Load & Standardize the Dataset

We will use `sklearn.datasets.load_digits()`. It provides 1797 images of handwritten digits, each of size 8×8, flattened into 64‑dim vectors: digits = load_digits()



---

## 2) Principal Component Analysis (PCA)

**Exercise 2.1 .**  
Fit PCA on `X_std` and:
1. Produce a **2‑D scatter** of the first two principal components (PC1 vs PC2) colored by digit label.
2. Plot the **scree plot** of explained variance ratio and the **cumulative explained variance**.  
3. Report the **minimum number of components** to explain at least **95%** of the variance.

<details><summary><strong>Why this matters (physics intuition)</strong></summary>
PCA diagonalizes the covariance matrix — analogous to finding **eigenmodes** (normal modes) of a system. The eigenvalues represent variances captured by each mode.
</details>



---

## 3) Singular Value Decomposition (SVD)

**Exercise 3.1 .**  
Compute the compact SVD of the **centered** data `X_centered = X_std` (already zero-mean) using `np.linalg.svd`:
1. Plot the **singular values** \(\sigma_i\) on a **log scale**.
2. Plot the **cumulative energy** defined as  
\[ E(k) = \frac{\sum_{i=1}^k \sigma_i^2}{\sum_{i=1}^r \sigma_i^2} \]  
and report the **k** for which **E(k) ≥ 0.95**.  
3. Verify: PCA eigenvalues are proportional to \(\sigma_i^2/(n-1)\).

<details><summary><strong>Connection to PCA</strong></summary>
For centered data, PCA directions equal right singular vectors of **X**, and PCA variances equal \(\sigma_i^2/(n-1)\). Many libraries implement PCA via SVD for numerical stability.
</details>



---

## 4) t‑SNE (t‑distributed Stochastic Neighbor Embedding)

**Exercise 4.1 .**  
Run t‑SNE on `X_std` with different **perplexities** and compare the 2‑D embeddings:

- Perplexities: `[5, 30, 50]` (keep other params default except `random_state=42`)
- For each run, make a scatter plot colored by labels and briefly note cluster compactness and separation.

> **Tip:** t‑SNE focuses on *local* neighborhood preservation and is sensitive to hyperparameters. Expect different layouts across runs.



---

## 5) UMAP (Uniform Manifold Approximation and Projection)

**Exercise 5.1 .**  
If available, run UMAP with different **n_neighbors** and **min_dist**, and compare the embeddings:
- Try pairs `(n_neighbors, min_dist)` in `[(5, 0.1), (15, 0.1), (50, 0.5)]`.
- Discuss local cluster separation vs global structure preservation.

> **Note:** UMAP is often faster and gives more stable global geometry than t‑SNE, but hyperparameters still matter.



---

## 6) Autoencoder (Neural Dimensionality Reduction)

We will build a tiny fully connected autoencoder with a **2‑D latent space**.

**Exercise 6.1 .**
1. Define an encoder–decoder with architecture 64→32→**2** (latent)→32→64 using ReLU except for the output (sigmoid).
2. Train it to reconstruct standardized inputs (`X_std`).
3. Extract the 2‑D latent embedding and plot it colored by labels.
4. Compare **reconstruction MSE** vs a **2‑D PCA** reconstruction.

> If `tensorflow` is not available, install it or skip to the comparative discussion below.



---

## 7) Comparative Discussion 

**Pros/Cons quick matrix (for this dataset):**
- **PCA** — fast, interpretable; linear; good for variance capture and denoising.
- **SVD** — general factorization; connects cleanly to PCA; singular values show energy spectrum.
- **t‑SNE** — excellent local cluster separation; stochastic layouts; sensitive to perplexity.
- **UMAP** — often faster, balances local & global; sensitive to `n_neighbors`/`min_dist`.
- **Autoencoders** — flexible nonlinear reduction; needs training; architecture & optimization matter.

**Prompts:**
- Does your PCA scree plot show a “knee”? Where would you truncate?
- Which method best separates classes? Which preserves global distances better?
- How do t‑SNE/UMAP hyperparameters affect cluster compactness vs. continuity?
- Compare reconstruction errors: AE(2D) vs PCA(2D). What might improve the AE (depth, nonlinearities, epochs)?



### (Optional) Visualize Reconstructions

Pick a few samples and compare original vs. reconstructions from PCA‑k and Autoencoder.
