## Module 12: Advanced Modelling Strategies

In this module, we’ll explore cutting-edge sequence and generative models for audio, as well as classical and deep approaches to source separation.

### 🔑 Key Concepts
- **Sequence Models for Audio**  
  - RNNs & LSTMs for modeling temporal dependencies  
  - Transformer architectures (self-attention) for long-range context  
- **Generative Waveform Models**  
  - WaveNet autoregressive synthesis  
  - GANs for realistic waveform and spectrogram generation  
- **Source Separation**  
  - NMF (Non-negative Matrix Factorization) for classical spectral decomposition  
  - Deep-learning–based separation (e.g. U-Net, OpenUnmix)

---

### 📓 Notebook Demos

1. **Transformer-Based Speech Enhancement**  
   - Add synthetic noise at varying SNRs  
   - Use a small Transformer encoder–decoder to clean speech  
   - Slider to control input noise level and hear enhanced output  

2. **NMF Music/Vocal Separation**  
   - Factorize a stereo music clip’s spectrogram via NMF  
   - Reconstruct and play back isolated vocal & accompaniment stems  
   - Compare to a pre-trained deep separator (e.g. OpenUnmix)

---

### 🛠 Exercise: RNN-Based Voice Activity Detection
- **Task:**  
  Build and train a simple RNN/LSTM to detect speech vs. silence frames.  
- **Steps:**  
  1. Extract short-time features (e.g. MFCC or log-mel) from a speech-silence dataset.  
  2. Train an RNN or bidirectional LSTM to output frame-wise speech probability.  
  3. Compare its performance (ROC, F₁) against an energy-threshold baseline.  
- **Deliverables:**  
  - Training code and model checkpoints  
  - Evaluation script reporting ROC curve and F₁ score  
  - Audio examples showing frame masks from both methods  


# 🔑 Key Concepts: Sequence Models for Audio

---

## 🔁 RNNs & LSTMs

**Recurrent Neural Networks (RNNs)** and their gated variants (**LSTMs**, **GRUs**) process audio as a **time-series**, maintaining a **hidden “state”** that carries information forward through time.

- Ideal for **frame-by-frame tasks** like:
  - Speech recognition
  - Voice-activity detection
- Capture **local temporal context**, which is crucial for sequential audio processing

---

## 🔀 Transformers (Self-Attention)

Transformers **replace recurrence** with **multi-headed self-attention**, allowing **each time step** to attend to **all others** in the sequence.

- Capture **very long-range dependencies** (e.g., across sentences or musical phrases)
- **Parallelizable** computation makes them efficient on modern hardware
- Widely used in:
  - Audio captioning
  - Music modeling
  - Large-scale speech models

---

## 🎵 Generative Waveform Models

### 🌀 WaveNet

- An **autoregressive CNN** that models **raw audio** one sample at a time
- Uses **dilated convolutions** to achieve large receptive fields
- Can generate **highly realistic speech or music**
- Often **conditioned** on linguistic or musical inputs

### 🤖 GANs for Audio

**Generative Adversarial Networks (GANs)** trained to produce:
- **Waveforms** or
- **Spectrograms**

that a discriminator cannot distinguish from real audio.

Used for tasks like:
- Speech enhancement
- Music synthesis
- Audio style transfer

---

## 🎚️ Source Separation

### 📊 NMF (Non-negative Matrix Factorization)

A **classical unsupervised method** that decomposes a spectrogram into:
- **Basis spectra**
- **Activation patterns**

Relies on **non-negativity** to find **additive components** (e.g., separating vocals from accompaniment).

---

### 🤖 Deep-Learning–Based Separation

Modern architectures like:
- **U-Nets**
- **Temporal Convolutional Networks (TCNs)**
- **OpenUnmix**

take a **spectrogram or waveform** as input and directly **predict separated sources**.

- Trained on **large datasets** of mixed/isolated tracks
- Deliver **state-of-the-art performance** in music and speech separation tasks


### 🔧 Demo 1: Transformer-Based Speech Enhancement (Overfit on One Clip)

In this demo you’ll train a tiny Transformer encoder to remove synthetic Gaussian noise from a single speech WAV file. All inputs are edited at the top of the code cell; there are no extra GUI sliders.

**What the code does:**
1. **Load** your clean speech WAV (using `soundfile` for compatibility).  
2. **Synthesize** a noisy version by adding Gaussian noise of standard deviation `NOISE_LEVEL`.  
3. **Compute STFTs** of clean & noisy signals to get magnitude (`mag_*`) and phase.  
4. **Build PyTorch tensors** of shape *(T × F)* where T = time frames, F = frequency bins.  
5. **Define** a simple Transformer-based enhancer (`Enhancer`):  
   - Projects input magnitudes to a `D_MODEL`-dim embedding,  
   - Passes through `N_LAYERS` of self-attention with `N_HEADS` heads,  
   - Projects back to frequency-bin dimension.  
6. **Train** (overfit) for `EPOCHS` epochs on this one example, minimizing MSE between predicted & clean magnitudes.  
7. **Enhance & invert** the noisy STFT back to waveform with the original phase.  
8. **Play back** the noisy input and enhanced output to judge quality improvements.

---

**USER SETTINGS** (edit these at the top of the code cell):

- `FILENAME_CLEAN`: name of your clean speech WAV file in `sounds/` (must end in `.wav`).  
- `NOISE_LEVEL`: noise standard deviation to add (valid range `0.0 … 0.2`).  
- `N_FFT`: STFT window size (power of two, e.g. `512, 1024, 2048`).  
- `HOP_LENGTH`: hop size between frames (`≤ N_FFT`, e.g. `N_FFT//4`).  
- `D_MODEL`: Transformer embedding dimension (e.g. `128, 256, 512`).  
- `N_HEADS`: number of attention heads (divisor of `D_MODEL`).  
- `N_LAYERS`: number of Transformer encoder layers.  
- `LR`: learning rate for Adam (e.g. `1e-3`).  
- `EPOCHS`: number of training epochs (e.g. `100–500` for a quick demo).

---

**Outputs to Observe:**

- **Console logs:** Loss printed every 50 epochs—should go down towards zero as model memorizes the clip.  
- **Audio players:**  
  - ▶️ **Noisy Input**: hear the synthetic noise level.  
  - ▶️ **Transformer-Enhanced Output**: listen for noise reduction and any Transformer artifacts.

- **Interpretation:**  
  - At low `NOISE_LEVEL`, enhancement will almost restore the original waveform.  
  - At higher `NOISE_LEVEL`, Transformer may struggle and introduce artifacts.  
  - Adjust `D_MODEL`, `N_HEADS`, and `N_LAYERS` to see how model capacity affects denoising quality.


In [None]:
# ── USER SETTINGS ────────────────────────────────────────────────────────────────
FILENAME_CLEAN = 'speech.WAV'  # ← your clean speech clip in sounds/ (WAV only)
NOISE_LEVEL    = 0.05                # ← noise std-dev (0.0 … 0.2)
N_FFT          = 1024                # ← STFT window size (power of two)
HOP_LENGTH     = N_FFT // 4          # ← hop size between frames
D_MODEL        = 256                 # ← transformer embedding dim
N_HEADS        = 4                   # ← number of self-attention heads
N_LAYERS       = 2                   # ← number of transformer encoder layers
LR             = 1e-3                # ← learning rate
EPOCHS         = 200                 # ← training epochs (overfit demo)
# ────────────────────────────────────────────────────────────────────────────────

import soundfile as sf
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import librosa
from IPython.display import Audio, display
from pathlib import Path

# 1) Load clean speech with soundfile for WAV compatibility
file_path = Path('sounds') / FILENAME_CLEAN
y_clean, sr = sf.read(str(file_path), dtype='float32')
# If stereo, take first channel
if y_clean.ndim > 1:
    y_clean = y_clean[:, 0]

# 2) Synthesize noisy signal
y_noisy = y_clean + NOISE_LEVEL * np.random.randn(len(y_clean))

# 3) Compute STFT magnitudes & phase
D_clean     = librosa.stft(y_clean,  n_fft=N_FFT, hop_length=HOP_LENGTH)
mag_clean, phase = np.abs(D_clean), np.angle(D_clean)
mag_noisy      = np.abs(librosa.stft(y_noisy, n_fft=N_FFT, hop_length=HOP_LENGTH))

# 4) Prepare PyTorch tensors of shape (T, F)
mag_noisy_t = torch.from_numpy(mag_noisy.T).float()
mag_clean_t = torch.from_numpy(mag_clean.T).float()

# 5) Define the simple Transformer-based enhancer
class Enhancer(nn.Module):
    def __init__(self, feat_dim, d_model, nhead, nlayers):
        super().__init__()
        self.input_proj  = nn.Linear(feat_dim, d_model)
        encoder_layer   = nn.TransformerEncoderLayer(d_model=d_model, nhead=nhead)
        self.encoder     = nn.TransformerEncoder(encoder_layer, num_layers=nlayers)
        self.output_proj = nn.Linear(d_model, feat_dim)
    def forward(self, x):
        # x shape: (T, F)
        z = self.input_proj(x)           # → (T, d_model)
        z = self.encoder(z)              # → (T, d_model)
        return self.output_proj(z)       # → (T, F)

model     = Enhancer(mag_noisy_t.shape[1], D_MODEL, N_HEADS, N_LAYERS)
optimizer = optim.Adam(model.parameters(), lr=LR)
criterion = nn.MSELoss()

# 6) Train (overfit) on this single clip
model.train()
for epoch in range(1, EPOCHS+1):
    optimizer.zero_grad()
    pred = model(mag_noisy_t)
    loss = criterion(pred, mag_clean_t)
    loss.backward()
    optimizer.step()
    if epoch % 50 == 0:
        print(f"Epoch {epoch}/{EPOCHS}  Loss: {loss.item():.4f}")

# 7) Enhance & invert back to waveform
model.eval()
with torch.no_grad():
    enhanced_mag = model(mag_noisy_t).cpu().numpy().T  # shape (F, T)

D_enh = enhanced_mag * np.exp(1j * phase)
y_enh = librosa.istft(D_enh, hop_length=HOP_LENGTH)

# 8) Playback
print("▶️ Noisy Input")
display(Audio(y_noisy, rate=sr))
print("▶️ Transformer-Enhanced Output")
display(Audio(y_enh,   rate=sr))


# 🎶 Demo 2: NMF-Based Music/Vocal Separation

In this demo, you’ll split a **stereo music clip** into two **stems** (e.g., vocals vs. accompaniment) using **Non-negative Matrix Factorization (NMF)**.

---

## 🔍 What the Code Does

1. **Loads your stereo music file**  
   → Automatically converted to **mono**

2. **Computes the magnitude spectrogram** \( D \) via **STFT** using `N_FFT` and `HOP_LENGTH`

3. **Factorizes** \( D \approx WH \) using **NMF** with rank `N_COMPONENTS`
   - \( W \): shape = *(frequency bins × components)*
   - \( H \): shape = *(components × time frames)*

4. **Splits the components in half**:
   - Let \( K = \left\lfloor \frac{\text{N_COMPONENTS}}{2} \right\rfloor \)
   - **Stem 1**: Reconstructed from the **first K** components
   - **Stem 2**: Reconstructed from the **remaining components**

5. **Reconstructs time-domain signals** for both stems  
   → Uses the **original phase** for inverse STFT

6. **Plays back**:
   - The **original mix**
   - **Stem 1**
   - **Stem 2**

7. **Plots** side-by-side:
   - Original magnitude spectrogram
   - Stem 1 spectrogram
   - Stem 2 spectrogram

📝 *Note: We’ve dropped the “compare to OpenUnmix” step for simplicity.*

---

## ⚙️ Inputs (Edit at the Top)

- **`FILENAME_MUSIC`**  
  Your stereo music clip in `sounds/` (`.wav` or `.mp3`)

- **`N_FFT`**  
  STFT window size (power of 2, e.g., `1024`, `2048`)

- **`HOP_LENGTH`**  
  Hop size between frames (≤ `N_FFT`, e.g., `N_FFT // 4` or `512`)

- **`N_COMPONENTS`**  
  NMF rank (**integer ≥ 2**)  
  Typical range: **2–16**

---

## 📤 Outputs to Observe

### 🎧 Audio Players

- ▶️ **Original Mix**
- ▶️ **Stem 1** (components `1…K`)
- ▶️ **Stem 2** (components `K+1…N_COMPONENTS`)

---

### 📈 Spectrograms

- Original **magnitude spectrogram**
- **Stem 1** spectrogram
- **Stem 2** spectrogram

---

## 🧠 How to Interpret

### 🎧 Listening Test
- Does **Stem 1** emphasize **vocals** (or lead instruments)?
- Does **Stem 2** retain **rhythm/harmony**?
- Listen for any **“bleed-through” artifacts**

### 📊 Spectrogram Comparison
- Stem spectrograms reveal **which frequency regions** each group of components captures
- Clearer separation usually occurs when **components align with distinct timbral structures**

---

🔧 **Try this**:  
Vary `N_COMPONENTS` (and thus \( K \)) to explore how the **model rank** affects separation quality!


In [None]:
# ── USER SETTINGS ────────────────────────────────────────────────────────────────
FILENAME_MUSIC = 'song-2-302326.mp3'     # ← stereo music clip in sounds/
N_FFT          = 2048                # ← STFT window size
HOP_LENGTH     = 512                 # ← hop size between frames
N_COMPONENTS   = 8                   # ← NMF rank (e.g. 2–16)
# ────────────────────────────────────────────────────────────────────────────────

import numpy as np
import librosa
import matplotlib.pyplot as plt
from sklearn.decomposition import NMF
from IPython.display import Audio, display
from pathlib import Path

# 1) Load stereo music, convert to mono
y, sr = librosa.load(str(Path('sounds')/FILENAME_MUSIC), sr=None, mono=True)

# 2) Compute magnitude spectrogram
D = np.abs(librosa.stft(y, n_fft=N_FFT, hop_length=HOP_LENGTH))

# 3) Fit NMF to magnitude
model = NMF(n_components=N_COMPONENTS, init='random', random_state=0, max_iter=200)
W = model.fit_transform(D)   # (F, K)
H = model.components_        # (K, T)

# 4) Assign half of the components to “vocals” and rest to “accompaniment”
#    (for a simple 2-stem split you might use N_COMPONENTS=2)
#    Here we take first K//2 → stem1, rest → stem2
K = N_COMPONENTS//2
D1 = W[:,:K] @ H[:K,:]
D2 = W[:,K:] @ H[K:,:]

# 5) Reconstruct time-domain signals (use noisy phase)
y1 = librosa.istft(D1 * np.exp(1j*np.angle(librosa.stft(y, n_fft=N_FFT, hop_length=HOP_LENGTH))),
                   hop_length=HOP_LENGTH)
y2 = librosa.istft(D2 * np.exp(1j*np.angle(librosa.stft(y, n_fft=N_FFT, hop_length=HOP_LENGTH))),
                   hop_length=HOP_LENGTH)

# 6) Playback and plot
print("▶️ Original Mix")
display(Audio(y, rate=sr))
print("▶️ Stem 1 (components 1…{})".format(K))
display(Audio(y1, rate=sr))
print("▶️ Stem 2 (components {}…{})".format(K+1, N_COMPONENTS))
display(Audio(y2, rate=sr))

# Plot spectrograms for comparison
fig, axs = plt.subplots(1,3, figsize=(15,4))
for ax, spec, title in zip(
    axs,
    [D, D1, D2],
    ['Original Magnitude','Stem 1 Mag','Stem 2 Mag']
):
    img = librosa.amplitude_to_db(spec, ref=np.max)
    librosa.display.specshow(img, sr=sr, hop_length=HOP_LENGTH,
                             x_axis='time', y_axis='log', ax=ax)
    ax.set_title(title)
plt.tight_layout()
plt.show()


---

## 🛠 Exercise: RNN-Based Voice Activity Detection

### 🎯 Task  
Build and train a simple **RNN/LSTM** to detect **speech vs. silence** on a **frame-by-frame** basis.

---

### 🧾 Steps

#### 📥 Feature Extraction
1. Load a **labeled speech–silence dataset**  
   *(e.g., audio clips with annotated speech segments)*

2. Compute **short-time features**:
   - MFCCs  
   - Or **log-mel spectrograms**  
   For **overlapping windows**

---

#### 🧠 Model Training
3. Define an **RNN** or **Bidirectional LSTM**:
   - Input: each frame’s feature vector
   - Output: probability of **speech**

4. Split data into **train/validation** sets (e.g., 80/20)

5. Train using:
   - **Binary cross-entropy loss**
   - Track **frame-level accuracy** and **loss**

---

#### ⚖️ Baseline Comparison
6. Implement a simple **energy-threshold detector**:
   - Classify a frame as speech if **RMS energy > threshold**

7. Tune the threshold on the **validation set**

---

### 📈 Evaluation

- Compute:
  - **ROC curve** and **AUC**
  - **Precision / Recall**
  - **F₁ score**
  - For both:
    - The **RNN model**
    - The **energy baseline**

- Plot:
  - Frame-wise **speech probability** vs. **ground truth**
  - Overlay **detected speech/non-speech regions** on waveform

---

### 📦 Deliverables

- ✅ **Well-documented training script**  
  + Saved **model checkpoint(s)**

- 📊 **Evaluation notebook or script** that outputs:
  - ROC curves
  - Confusion matrices
  - F₁ scores for both methods

- 🎧 **Audio playback examples**  
  Showing the **predicted speech mask** vs. the **baseline mask**

---

💻 **Tools**
- Python + TensorFlow or PyTorch
- `librosa` for feature extraction
- `matplotlib` / `seaborn` for plotting
- `scikit-learn` for metrics and ROC computation
