## Module 10: Dimensionality Reduction & Selection

In this module, we’ll tackle methods for reducing and selecting features to combat high dimensionality, improve visualization, and prevent overfitting.

### Key Concepts
- **Principal Component Analysis (PCA)**  
  Linear projection that captures maximum variance in orthogonal directions.
- **Nonlinear Embeddings: t-SNE & UMAP**  
  Manifold learning techniques for visualizing high-dimensional data in 2D/3D.
- **Feature Selection**  
  - **Mutual Information**: measure dependency between features and labels  
  - **Variance Thresholding**: remove near-constant features  
- **Overfitting & Curse of Dimensionality**  
  Why too many features can degrade model performance and generalization.

---

### 📓 Notebook Demos

1. **3D PCA Scatter of Instrument Embeddings**  
   - Compute feature vectors (e.g., MFCC means) for various instruments  
   - Perform PCA to reduce to 3 components  
   - Plot interactive 3D scatter, colored by instrument class, with rotation controls  

2. **Comparing UMAP vs. t-SNE on MFCC Vectors**  
   - Collect MFCC feature vectors for a dataset of audio clips  
   - Compute 2D projections using UMAP and t-SNE  
   - Display side-by-side scatter plots to compare clustering and neighborhood structure  

---

### 🛠 Exercise: 2D Embedding & Clustering

- **Task:**  
  Take a 40-dimensional feature representation of environmental sounds and reduce it to 2 dimensions using your choice of PCA, t-SNE, or UMAP.  

- **Steps:**  
  1. Load or compute a 40-dim feature matrix (e.g., MFCC mean/variance, spectral contrast).  
  2. Apply dimensionality reduction to 2D.  
  3. Cluster the 2D points (e.g., K-Means) and visualize cluster assignments.  

- **Deliverables:**  
  - Plots of the 2D embedding with clusters labeled  
  - Quantitative evaluation (e.g., silhouette score) comparing methods  
  - Brief discussion of which technique best preserves class structure  


### Key Concepts

- **Principal Component Analysis (PCA)**  
  A linear dimensionality-reduction technique that finds orthogonal directions (principal components) capturing the maximum variance in your data. Useful for compressing high-dimensional features while preserving as much “energy” (variance) as possible.

- **Nonlinear Embeddings: t-SNE & UMAP**  
  Manifold learning methods that map high-dimensional data into 2D or 3D for visualization:  
  - **t-SNE** focuses on preserving local neighborhood structure (clusters).  
  - **UMAP** preserves both local and some global structure and tends to run faster on large datasets.

- **Feature Selection**  
  Techniques to pick the most informative features and discard the rest:  
  - **Mutual Information**  
    Quantifies the dependency between each feature and the target labels; higher MI ⇒ more predictive.  
  - **Variance Thresholding**  
    Removes features whose variance falls below a chosen threshold (near-constant features carry little information).

- **Overfitting & Curse of Dimensionality**  
  - **Overfitting:** Too many features relative to data points can cause models to “memorize” noise instead of learning generalizable patterns.  
  - **Curse of Dimensionality:** As dimensionality grows, the volume of feature space expands exponentially, making data sparse and distance metrics less meaningful.

Understanding and mitigating these issues is critical for building robust, high-performing audio-analysis pipelines.  


## Demo: 3D PCA Scatter of Instrument Embeddings

In this demo you’ll see how simple timbral features (mean MFCCs) can be embedded in a low-dimensional space and visualized in 3D to reveal clustering by instrument.

**What the code does:**
1. **Loads** each audio clip from the `sounds/` folder, using the `FILES` list of `(Label, Filename)` pairs.
2. **Extracts** Mel-frequency cepstral coefficients (MFCCs) from each clip:
   - Computes `N_MFCC` coefficients per frame with an `N_FFT`-point FFT and `HOP_LENGTH` hop.
   - Averages each coefficient across time to form a single feature vector per clip.
3. **Stacks** these feature vectors into a matrix `X` of shape `(n_clips, N_MFCC)`.
4. **Performs PCA** to reduce `X` to 3 principal components.
5. **Plots** an interactive 3D scatter (via Plotly) of the resulting coordinates, coloring and symbolizing points by instrument label.

**USER SETTINGS** (edit these at the top of the code cell):
- `FILES`: list of `(Instrument Label, Filename)` tuples.  
  - Filenames must exist in `sounds/` (supported formats: WAV, MP3).  
- `N_MFCC`: integer ≥ 1, number of MFCC coefficients to extract (e.g. 13).  
- `N_FFT`: FFT window size (power of two, e.g. 1024, 2048).  
- `HOP_LENGTH`: hop size in samples (≤ `N_FFT`, e.g. `N_FFT // 4`).

**What to observe:**
- **Clusters**: Instruments with similar timbral characteristics (MFCC means) will appear close together.  
- **Separation**: Distinct timbres form well‐separated groups in 3D space.  
- **Axes**: PC1–PC3 capture the directions of greatest variance in the feature set.

**How to interpret:**
- Points that cluster by label indicate that mean-MFCC features differentiate those instruments.  
- Interactive rotation helps you explore which PCs best separate particular instruments.  
- Consider adding or removing instruments or adjusting `N_MFCC` to see how the embedding changes.


In [None]:
# ── USER SETTINGS ────────────────────────────────────────────────────────────────
# List your instrument clips and labels here. Place files in the `sounds/` folder.
FILES = [
    ('Violin',   'violin-loop-154193.mp3'),
    ('Flute',    'flute-rain-flute-loop-ambient-short-loop-340800.mp3'),
    ('Piano',    'soft-piano-100-bpm-121529.mp3'),
    ('Trumpet',  'trumpet-75426.mp3'),
    # add more as desired...
]

N_MFCC     = 13        # ← number of MFCC coefficients to extract
N_FFT      = 2048      # ← FFT window size (power of 2)
HOP_LENGTH = N_FFT // 4  # ← hop length between frames
# ────────────────────────────────────────────────────────────────────────────────

import numpy as np
import pandas as pd
import librosa
import soundfile as sf
from sklearn.decomposition import PCA
import plotly.express as px
from pathlib import Path

# ── CONFIG (don’t edit below here) ───────────────────────────────────────────────
SOUNDS_DIR = Path('sounds')

# 1) Load each clip, extract mean MFCC feature vector
feature_list = []
labels       = []

for label, fname in FILES:
    path = SOUNDS_DIR / fname
    ext  = path.suffix.lower()
    if ext == '.wav':
        y, sr = sf.read(str(path), dtype='float32')
    else:
        y, sr = librosa.load(str(path), sr=None)
    # if stereo, take left channel
    if y.ndim > 1:
        y = y[:,0]
    
    mfcc = librosa.feature.mfcc(
        y=y, sr=sr,
        n_mfcc=N_MFCC,
        n_fft=N_FFT,
        hop_length=HOP_LENGTH
    )
    # mean over time frames
    feat = mfcc.mean(axis=1)
    feature_list.append(feat)
    labels.append(label)

X = np.stack(feature_list)  # shape: (n_clips, N_MFCC)

# 2) PCA to 3 components
pca = PCA(n_components=3)
coords = pca.fit_transform(X)

# 3) Build a DataFrame for plotting
df = pd.DataFrame({
    'PC1': coords[:,0],
    'PC2': coords[:,1],
    'PC3': coords[:,2],
    'Instrument': labels
})

# 4) Interactive 3D scatter plot
fig = px.scatter_3d(
    df,
    x='PC1', y='PC2', z='PC3',
    color='Instrument',
    symbol='Instrument',
    title='3D PCA of Instrument MFCC Embeddings',
    width=800, height=600
)
fig.update_layout(scene=dict(
    xaxis_title='PC1',
    yaxis_title='PC2',
    zaxis_title='PC3'
))
fig.show()


# 📊 Demo: Comparing t-SNE vs. UMAP on MFCC Embeddings

In this demo, you’ll **project high-dimensional MFCC feature vectors** of audio clips down to 2D using two popular manifold learning techniques—**t-SNE** and **UMAP**—and compare how each preserves **neighborhood structure**.

---

## 🔍 What the Code Does

### 🎼 Load & Feature Extraction
- Iterates through your list of `(label, filename)` pairs in `FILES`
- Loads each clip:
  - `.wav` via `soundfile`
  - Other formats via `librosa`
  - Takes **first channel** if stereo
- Computes **MFCCs** with:
  - `N_MFCC` coefficients
  - `N_FFT` window size
  - `HOP_LENGTH` frame spacing
- Averages MFCCs **over time** to get one **feature vector per clip**

---

### 📉 t-SNE Projection
- **Caps `perplexity`** to be at most `n_samples - 1`
- Runs t-SNE with:
  - `perplexity = perp_capped`
  - `max_iter = TSNE_N_ITER`
  - Random initialization
- Produces **2D coordinates**

---

### 🌐 UMAP Projection
- Runs UMAP with:
  - `n_neighbors = UMAP_N_NEIGHBORS`
  - `min_dist = UMAP_MIN_DIST`
  - `metric = UMAP_METRIC`
- Produces another **2D projection**

---

### 🖼 Visualization
- Creates **side-by-side scatter plots**:
  - One for **t-SNE**
  - One for **UMAP**
- Points are **colored and marked** by **instrument label**
- Lets you compare how well each algorithm **clusters similar sounds**

---

## ⚙️ Inputs (Edit at Top of the Code Cell)

- **`FILES`**: List of `(label, filename)` tuples (audio files in `sounds/`)
- **`N_MFCC`**: Number of MFCC coefficients (e.g., `13–40`)
- **`N_FFT`**: FFT window size (e.g., `1024`, `2048`)
- **`HOP_LENGTH`**: Hop size between frames (e.g., `N_FFT // 4`)
- **`TSNE_PERPLEXITY`**: t-SNE perplexity (must be < number of clips, e.g., `5–50`)
- **`TSNE_N_ITER`**: t-SNE iterations (e.g., `250–2000`)
- **`UMAP_N_NEIGHBORS`**: UMAP neighborhood size (e.g., `5–50`)
- **`UMAP_MIN_DIST`**: UMAP minimum distance (`0.0–0.5`)
- **`UMAP_METRIC`**: Distance metric for UMAP (e.g., `'euclidean'`, `'cosine'`)

---

## 📤 Outputs to Observe

### 🖼 Scatter Plots: t-SNE vs. UMAP
- Look for **tight clusters** by instrument label
- **t-SNE** may form **local “islands”**
- **UMAP** often preserves **global layout and continuity**

### 🏷 Plot Titles Show:
- **t-SNE**: Final capped perplexity value
- **UMAP**: Neighborhood size and minimum distance used

---

## 🧠 How to Interpret

### 🎯 Cluster Cohesion
- Are **similar instruments grouped closely**?

### ✂️ Separation
- Can you **visually separate different classes**?

---

### 🌍 Global vs. Local Behavior

- **t-SNE**:
  - Excels at **local cluster detail**
  - May distort **global relationships**

- **UMAP**:
  - Balances **local and global structure**
  - Often clearer global layout

---

### 🔧 Parameter Effects

- **Higher perplexity** → t-SNE considers more neighbors → smoother, more global structure
- **Larger n_neighbors** → UMAP includes broader context → more global cohesion
- **Smaller min_dist** → UMAP clusters points more tightly

---

💡 **Experiment**:  
Try tweaking `TSNE_PERPLEXITY`, `UMAP_N_NEIGHBORS`, and `UMAP_MIN_DIST`, then re-run the cell to observe how **projection quality and cluster layout** change!


In [None]:
# … your USER SETTINGS above …

import numpy as np
import librosa
import soundfile as sf
from pathlib import Path
from sklearn.manifold import TSNE
from umap import UMAP
import matplotlib.pyplot as plt

# Helper to load audio …
def load_audio(path, sr=None):
    ext = path.suffix.lower()
    if ext == '.wav':
        y, sr_native = sf.read(str(path), dtype='float32')
        return (y, sr_native) if sr is None else (librosa.resample(y, sr_native, sr), sr)
    else:
        return librosa.load(str(path), sr=sr)

# 1) Extract mean‐MFCCs
feature_list = []
labels       = []
SOUNDS_DIR   = Path('sounds')
for label, fname in FILES:
    y, sr = load_audio(SOUNDS_DIR / fname, sr=None)
    if y.ndim > 1: y = y[:,0]
    mfcc = librosa.feature.mfcc(y=y, sr=sr,
                                n_mfcc=N_MFCC,
                                n_fft=N_FFT,
                                hop_length=HOP_LENGTH)
    feature_list.append(mfcc.mean(axis=1))
    labels.append(label)
X = np.stack(feature_list)

# 2) t-SNE projection (cap perplexity < n_samples)
n_samples    = X.shape[0]
perp_capped  = min(TSNE_PERPLEXITY, n_samples - 1)
tsne = TSNE(
    n_components=2,
    perplexity=perp_capped,
    max_iter=TSNE_N_ITER,    # renamed from n_iter
    init='random',
    random_state=0
)
X_tsne = tsne.fit_transform(X)

# 3) UMAP projection
umap_model = UMAP(
    n_components=2,
    n_neighbors=UMAP_N_NEIGHBORS,
    min_dist=UMAP_MIN_DIST,
    metric=UMAP_METRIC,
    random_state=0
)
X_umap = umap_model.fit_transform(X)

# 4) Plot side-by-side
fig, axes = plt.subplots(1, 2, figsize=(14,6), constrained_layout=True)
for ax, data, title in zip(
    axes,
    [X_tsne, X_umap],
    [
      f"t-SNE (perplexity={perp_capped})",
      f"UMAP (n_neighbors={UMAP_N_NEIGHBORS}, min_dist={UMAP_MIN_DIST})"
    ]
):
    for lbl in set(labels):
        mask = np.array(labels) == lbl
        ax.scatter(data[mask,0], data[mask,1], label=lbl, alpha=0.8)
    ax.set_title(title)
    ax.set_xlabel('Dim 1')
    ax.set_ylabel('Dim 2')
    ax.grid(True)
    ax.legend(loc='best')
plt.show()


---

## 🛠 Exercise: 2D Embedding & Clustering

### 🎯 Task  
Take a **40-dimensional feature representation** of **environmental sounds** and reduce it to **2 dimensions** using your choice of **PCA**, **t-SNE**, or **UMAP**.

---

### 🧾 Steps

1. **Prepare Features**  
   Load or compute a **40-D feature matrix** for a collection of environmental audio clips.  
   Example features:
   - MFCC means/variances
   - Spectral contrast
   - Zero-crossing rate
   - RMS energy

2. **Dimensionality Reduction**  
   Apply one of the following methods to project the 40-D data down to 2D:
   - **PCA**
   - **t-SNE**
   - **UMAP**

3. **Clustering**  
   Run a clustering algorithm (e.g., **K-Means**) on the 2D embeddings.

4. **Visualization**  
   Plot the 2D points:
   - **Color-coded by cluster label**
   - Optionally **overlay true class labels** if available

---

### 📦 Deliverables

- 📊 **2D Embedding Plots**  
  - Clear cluster visualization for each method you try

- 🔍 **Quantitative Evaluation**  
  - Use metrics like:
    - **Silhouette score**
    - **Davies–Bouldin index**
  - Compare performance across PCA, t-SNE, and UMAP

- ✍️ **Brief Discussion**  
  - Which dimensionality reduction method **best preserves class structure**, and **why**?
  - Reflect on the **strengths/limitations** of each approach

---

💡 *Tip:* Run all three techniques side-by-side to compare not only **visual separability**, but also **quantitative clustering quality**.
