## Module 11: Learned Representations & Deep Features

In this module, we’ll move beyond hand-crafted descriptors and explore how deep neural networks learn powerful audio representations.

### Key Concepts
- **Autoencoders & CNNs on Spectrograms**  
  - Use convolutional architectures to compress and reconstruct spectrograms (autoencoders)  
  - Learn hierarchical time–frequency features directly from data  
- **Pre-trained Models**  
  - VGGish, Wav2Vec2, YAMNet, etc. provide off-the-shelf embeddings  
  - Leverage large-scale, pre-trained audio networks for downstream tasks  
- **Transfer Learning Strategies**  
  - Feature extraction: freeze a base model and train a shallow classifier on top  
  - Fine-tuning: unfreeze some layers to adapt representations to your domain  

---

### 📓 Notebook Demos

1. **Wav2Vec2 Embedding Visualization**  
   - Load a pre-trained **Facebook Wav2Vec2** audio transformer  
   - Compute 768-dim frame-level embeddings for each clip, then average to clip-level  
   - Project to 2D with **PCA** or **t-SNE** and display an interactive scatter colored by genre/speaker  

2. **CNN Fine-Tuning on Commands vs. Noise**  
   - Build a simple spectrogram-based CNN in PyTorch or TensorFlow  
   - Freeze early layers of a pre-trained backbone (e.g. VGGish)  
   - Fine-tune last few layers on a small “voice commands vs. background noise” dataset  
   - Monitor training/validation accuracy live in the notebook  

---

### 🛠 Exercise: Deep Feature Comparison  
- **Task:**  
  Extract deep embeddings (e.g. from Wav2Vec2, VGGish, or YAMNet) and train a classifier (random forest or small MLP) on your own labeled dataset.  
- **Analysis:**  
  Compare classification accuracy and confusion matrices against a baseline using hand-crafted features (MFCCs + spectral features).  
- **Deliverables:**  
  - Code to load the pre-trained model and extract embeddings  
  - Classification scripts for both deep and hand-crafted feature sets  
  - A performance report discussing which representation performs better—and why  


### Key Concepts

- **Autoencoders & CNNs on Spectrograms**  
  - Use convolutional architectures to compress and reconstruct spectrograms (autoencoders).  
  - Learn hierarchical time–frequency features directly from raw audio data.

- **Pre-trained Models**  
  - VGGish, OpenL3, YAMNet, etc., provide off-the-shelf audio embeddings.  
  - Leverage representations learned on large-scale datasets for your own tasks.

- **Transfer Learning Strategies**  
  - **Feature Extraction:** Freeze the base model’s weights and train a shallow classifier on the extracted embeddings.  
  - **Fine-Tuning:** Unfreeze one or more layers of the pre-trained network to adapt its representations to your specific domain.  


## Demo 1: Deep Audio Embedding Visualization

In this demo we use a **pre-trained Wav2Vec2** model (from Facebook AI) to extract semantically rich embeddings from raw audio, then project them to 2D for visualization.

**What you’ll learn:**  
- How to load and run a pre-trained audio transformer  
- How to average frame‐level embeddings into clip‐level vectors  
- How to compare PCA vs. t-SNE for visualizing high-dim representations  

---

### How to use

1. **Edit the USER SETTINGS** at the top of the code cell:  
   - `FILES`: list of `(Label, Filename)` pairs in `sounds/`  
   - `MODEL_NAME`: Hugging Face model name (e.g. `'facebook/wav2vec2-base-960h'`)  

2. **Run the cell** to:  
   - Load each clip (resampling to 16 kHz as required)  
   - Compute 768-dim Wav2Vec2 embeddings, averaged over time  
   - Reduce to 2D with either **PCA** or **t-SNE**  
   - Plot an interactive scatter (Plotly) colored by label  

---

### What to observe

- **Cluster formation:** Groups of points indicate that the embedder captures shared characteristics (genre, speaker, instrument, etc.).  
- **Separation vs. overlap:** PCA may give a rough separation; t-SNE can reveal finer local structure at the cost of global geometry.  
- **Model choice:** You can swap in other pre-trained audio models (YAMNet, TRILL, OpenL3 if installable) by changing `MODEL_NAME` and the forward pass.



In [None]:
# ── USER SETTINGS ────────────────────────────────────────────────────────────────
# List your clips and labels here; place files in `sounds/`
FILES = [
    ('Classical', 'beethoven-symphony-no-7-2nd-movement-246122.mp3'),
    ('Jazz',      'sax-jazz-77053.mp3'),
    ('Rock',      'rock-music-6211.mp3'),
    ('Speech',    'speech.WAV'),
    # … add more (label, filename) pairs as desired …
]
MODEL_NAME = 'facebook/wav2vec2-base-960h'   # ← Hugging Face Wav2Vec2 model
USE_TSNE   = False                           # ← False→PCA, True→t-SNE
TSNE_PERP  = 5                               # ← t-SNE perplexity (5–min(n_samples−1))
TSNE_ITER  = 1000                            # ← t-SNE iterations
# ────────────────────────────────────────────────────────────────────────────────

import numpy as np
import soundfile as sf
import librosa
import torch
from transformers import Wav2Vec2Processor, Wav2Vec2Model
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import plotly.express as px
from pathlib import Path

# Load processor & model
processor = Wav2Vec2Processor.from_pretrained(MODEL_NAME)
model     = Wav2Vec2Model.from_pretrained(MODEL_NAME)

SOUNDS_DIR = Path('sounds')

def load_and_prepare(path):
    y, sr = sf.read(str(path), dtype='float32') if path.suffix.lower()=='.wav' \
           else librosa.load(str(path), sr=None)
    if y.ndim>1:    # stereo → mono
        y = y.mean(axis=1)
    if sr != 16000:
        y = librosa.resample(y, orig_sr=sr, target_sr=16000)
        sr = 16000
    return y, sr

# 1) Compute embeddings
embeddings, labels = [], []
for label, fname in FILES:
    path = SOUNDS_DIR / fname
    y, sr = load_and_prepare(path)
    inputs = processor(y, sampling_rate=sr, return_tensors='pt', padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    # mean over time → clip-level vector
    emb = outputs.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()
    embeddings.append(emb)
    labels.append(label)

X = np.stack(embeddings)  # (n_clips, 768)

# 2) Reduce to 2D
if USE_TSNE:
    perp = min(TSNE_PERP, X.shape[0]-1)
    dr = TSNE(n_components=2, perplexity=perp, n_iter=TSNE_ITER, init='random', random_state=0)
    title = f"t-SNE (perp={perp})"
else:
    dr = PCA(n_components=2, random_state=0)
    title = "PCA (2D)"
X2 = dr.fit_transform(X)

# 3) Interactive scatter
import pandas as pd
df = pd.DataFrame({
    'Dim1': X2[:,0],
    'Dim2': X2[:,1],
    'Label': labels
})
fig = px.scatter(
    df, x='Dim1', y='Dim2',
    color='Label', symbol='Label',
    title=f"{title} of Wav2Vec2 Embeddings",
    width=700, height=500
)
fig.update_layout(xaxis_title='Component 1', yaxis_title='Component 2')
fig.show()


# 🧠 Demo 2: CNN Fine-Tuning on Voice Commands vs. Background Noise

In this demo, you’ll build a **spectrogram-based classifier** by repurposing an **ImageNet-pretrained MobileNetV2** backbone.

---

## 📥 Data Loading & Preprocessing

### 🔧 Inputs (Edit at Top)

- **`COMMANDS_DIR` / `NOISE_DIR`**: Folders containing your **voice commands** and **background noise** files (`.wav` or `.mp3`)
- **`SAMPLE_RATE`** (e.g., `16000 Hz`): Target resampling rate for all clips
- **`DURATION`** (seconds): Clips are padded or truncated to this exact length
- **`N_MELS`**, **`FFT_WINDOW`**, **`HOP_LENGTH`**: Mel-spectrogram extraction parameters
- **`BATCH_SIZE`**, **`EPOCHS`**, **`LEARNING_RATE`**: Training hyperparameters

### 🧪 Process

1. **Read each file** and **resample** to `SAMPLE_RATE`
2. **Pad or truncate** to match `DURATION`
3. **Compute log-Mel spectrogram** with `N_MELS` bands
4. **Normalize** to zero-mean / unit-variance
5. **Add a channel axis** for CNN input

---

## 🧱 Dataset Construction

- Builds a single `tf.data.Dataset` of **(spectrogram, label)** pairs
- **Splits 80/20** into **training vs. validation** sets

---

## 🧠 Model Architecture

- **Input**: (`N_MELS × time_frames × 1`) Mel-spectrogram
- **Channel Replication**: Stack to 3 channels for ImageNet CNN input
- **Resizing**: `layers.Resizing(224, 224)` to match **MobileNetV2** input size
- **Backbone**: `MobileNetV2(include_top=False, weights='imagenet')` — **frozen**
- **Head**:
  - Global Average Pooling
  - `Dropout(0.3)`
  - `Dense(1, activation='sigmoid')`

---

## 🏋️‍♂️ Training

- **Loss**: Binary cross-entropy
- **Optimizer**: Adam with `LEARNING_RATE`
- **Metrics**: Accuracy
- Train for `EPOCHS`, monitoring:
  - Training loss/accuracy
  - Validation loss/accuracy

---

## 📊 Outputs to Observe

### 📈 Training Curves

- **Loss Plot**:
  - Training vs. Validation loss
  - Ensures the model is learning and not overfitting

- **Accuracy Plot**:
  - Training vs. Validation accuracy
  - Gauges classification performance

### ✅ Final Validation Accuracy

- A quick summary of how well the fine-tuned CNN distinguishes **voice commands** from **background noise**

---

## 🧠 Interpretation Tips

- If **validation accuracy lags training**:
  - Try **unfreezing more backbone layers**
  - Add **regularization** (e.g., dropout, weight decay)

- If **both train & val accuracy are low**:
  - Your dataset may need **more examples**
  - Or consider using a **custom CNN architecture**

- 📏 **Note on Resizing**:
  - The `224×224` resizing **trades time–frequency resolution** for compatibility with the pretrained model
  - Alternatives include using **custom CNNs** that accept native spectrogram sizes


In [None]:
# ── USER SETTINGS ────────────────────────────────────────────────────────────────
COMMANDS_DIR   = 'sounds/commands'     # ← folder containing your “voice commands” WAV/MP3 files
NOISE_DIR      = 'sounds/noise'        # ← folder containing your “background noise” WAV/MP3 files
SAMPLE_RATE    = 16000                 # ← target sample rate for all audio
DURATION       = 1.0                   # ← clip length in seconds (pad/truncate)
N_MELS         = 64                    # ← number of mel bands for spectrograms
FFT_WINDOW     = 512                   # ← FFT window size (samples)
HOP_LENGTH     = 256                   # ← hop between frames (samples)
BATCH_SIZE     = 16                    # ← batch size for training/validation
EPOCHS         = 10                    # ← number of training epochs
LEARNING_RATE  = 1e-4                  # ← optimizer learning rate
# ────────────────────────────────────────────────────────────────────────────────

import os, glob
import numpy as np
import tensorflow as tf
import librosa
import matplotlib.pyplot as plt
from tensorflow.keras import layers, models, applications, optimizers

# ── DATA LOADING & PREPROCESSING ────────────────────────────────────────────────

def preprocess_file(path):
    """
    path may be a Python str or tf.Tensor of type string.
    We convert it to a native str before loading with librosa.
    """
    # If tf.Tensor, extract Python value
    if hasattr(path, "numpy"):
        path = path.numpy()
    if isinstance(path, bytes):
        filepath = path.decode("utf-8")
    else:
        filepath = str(path)
    # 1) Load & resample
    wav, sr = librosa.load(filepath, sr=None)
    if sr != SAMPLE_RATE:
        wav = librosa.resample(wav, orig_sr=sr, target_sr=SAMPLE_RATE)
    # 2) Pad or truncate to exactly DURATION seconds
    target_len = int(SAMPLE_RATE * DURATION)
    if wav.shape[0] < target_len:
        wav = np.pad(wav, (0, target_len - wav.shape[0]))
    else:
        wav = wav[:target_len]
    # 3) Compute log-mel spectrogram (disable centering to keep frame count fixed)
    mel = librosa.feature.melspectrogram(
        y=wav,
        sr=SAMPLE_RATE,
        n_fft=FFT_WINDOW,
        hop_length=HOP_LENGTH,
        n_mels=N_MELS,
        center=False
    )
    log_mel = np.log(mel + 1e-6).astype(np.float32)
    # 4) Normalize to zero mean / unit variance
    log_mel = (log_mel - log_mel.mean()) / (log_mel.std() + 1e-6)
    # 5) Add channel dimension
    return np.expand_dims(log_mel, -1)

def tf_preprocess(path, label):
    # Wrap our preprocessing in tf.py_function
    spec = tf.py_function(preprocess_file, inp=[path], Tout=tf.float32)
    # We know the output shape exactly
    time_frames = int(np.floor((SAMPLE_RATE*DURATION - FFT_WINDOW) / HOP_LENGTH)) + 1
    spec.set_shape([N_MELS, time_frames, 1])
    return spec, label

# Gather file lists
cmd_wavs  = glob.glob(os.path.join(COMMANDS_DIR, '*.wav'))
cmd_mp3s  = glob.glob(os.path.join(COMMANDS_DIR, '*.mp3'))
cmd_files = cmd_wavs + cmd_mp3s
noise_wavs  = glob.glob(os.path.join(NOISE_DIR, '*.wav'))
noise_mp3s  = glob.glob(os.path.join(NOISE_DIR, '*.mp3'))
noise_files = noise_wavs + noise_mp3s

if not cmd_files:
    raise ValueError(f"No audio files found in {COMMANDS_DIR}")
if not noise_files:
    raise ValueError(f"No audio files found in {NOISE_DIR}")

# Compute train/validation split counts
total_samples = len(cmd_files) + len(noise_files)
train_count   = int(0.8 * total_samples)

# Build tf.data.Dataset of (filepath, label)
ds_cmd   = tf.data.Dataset.from_tensor_slices((cmd_files, [1]*len(cmd_files)))
ds_noise = tf.data.Dataset.from_tensor_slices((noise_files, [0]*len(noise_files)))
dataset  = ds_cmd.concatenate(ds_noise)
dataset  = dataset.shuffle(total_samples, reshuffle_each_iteration=True)
dataset  = dataset.map(tf_preprocess, num_parallel_calls=tf.data.AUTOTUNE)

# Split into train/validation
train_ds = dataset.take(train_count).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
val_ds   = dataset.skip(train_count).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)

# ── MODEL DEFINITION ─────────────────────────────────────────────────────────────
from tensorflow.keras import layers, models, applications, optimizers

# ── MODEL DEFINITION ─────────────────────────────────────────────────────────────

# Compute time_frames again for model input shape
time_frames = int(np.floor((SAMPLE_RATE*DURATION - FFT_WINDOW) / HOP_LENGTH)) + 1
input_shape = (N_MELS, time_frames, 1)

inputs = layers.Input(shape=input_shape)

# 1) Convert 1-channel → 3-channel
x = layers.Concatenate()([inputs, inputs, inputs])   # now shape = (None, N_MELS, time_frames, 3)

# 2) Resize to 224×224 so we match MobileNetV2’s expected input
x = layers.Resizing(224, 224)(x)                     # now shape = (None, 224, 224, 3)

# 3) Pre-trained backbone
backbone = applications.MobileNetV2(
    input_shape=(224, 224, 3),
    include_top=False,
    weights='imagenet'
)
backbone.trainable = False

x = backbone(x, training=False)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dropout(0.3)(x)
outputs = layers.Dense(1, activation='sigmoid')(x)

model = models.Model(inputs, outputs)
model.compile(
    optimizer=optimizers.Adam(LEARNING_RATE),
    loss='binary_crossentropy',
    metrics=['accuracy']
)
model.summary()



# ── TRAINING ─────────────────────────────────────────────────────────────────────

history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=EPOCHS
)

# ── PLOTTING TRAIN/VAL METRICS ────────────────────────────────────────────────────

plt.figure(figsize=(12,4))

# Loss
plt.subplot(1,2,1)
plt.plot(history.history['loss'],     label='train loss')
plt.plot(history.history['val_loss'], label='val loss')
plt.title('Loss')
plt.legend()
plt.grid(True)

# Accuracy
plt.subplot(1,2,2)
plt.plot(history.history['accuracy'],     label='train acc')
plt.plot(history.history['val_accuracy'], label='val acc')
plt.title('Accuracy')
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()


---

## 🛠 Exercise: Deep Feature Comparison

### 🎯 Task  
Extract **deep audio embeddings** (e.g., from **Wav2Vec2**, **VGGish**, or **YAMNet**) and train a classifier (**Random Forest** or a small **MLP**) on your own **labeled audio dataset**.

---

### 🔍 Analysis  
Compare the **classification performance** of deep features vs. a **baseline model using hand-crafted features** like MFCCs and spectral descriptors.

---

### 📦 Deliverables

#### ✅ Feature Extraction Code
- Load a **pretrained model** (e.g., Wav2Vec2, VGGish, YAMNet)
- Extract **deep embeddings** from each audio clip

#### ✅ Classification Scripts

1. **Deep Embedding Pipeline**
   - Use embeddings + **Random Forest (RF)** or **Multi-Layer Perceptron (MLP)**

2. **Hand-Crafted Feature Pipeline**
   - Use features like:
     - MFCCs
     - Spectral centroid
     - Bandwidth
   - Train using the same **RF or MLP** classifier

---

### 📊 Performance Report

- **Metrics**:
  - Accuracy
  - Precision / Recall / F₁-score

- **Confusion Matrices**:
  - For both **deep** and **hand-crafted** models
  - Analyze **error patterns**

- **Discussion**:
  - Which feature representation performed **best**?
  - Provide **hypotheses** to explain your observations

---

💡 *Tip*: Use dimensionality reduction (e.g., PCA or t-SNE) on embeddings to **visualize feature separability**!
