## Module 8: Cepstral & Filter-Bank Features

In this module, we’ll study how to transform audio into perceptually meaningful representations using cepstral and filter-bank techniques.

### Key Concepts
- **Cepstrum & Quefrency Domain**  
  - The “spectrum of a spectrum” obtained by taking the inverse Fourier transform of log-magnitude spectra.  
  - **Liftering** to separate source vs. filter contributions (e.g. pitch vs. envelope).
- **Mel-Filter Banks**  
  - Triangular filters spaced on the mel scale to mimic human auditory resolution.  
  - Build the mel-spectrogram and derive MFCCs by DCT of log-mel energies.
- **Δ & ΔΔ (Temporal Derivatives)**  
  - Capture short-term dynamics by computing first and second derivatives of cepstral or filter-bank features.

---

### 📓 Notebook Demos

1. **Mel-Filter Bank Visualization**  
   - Compute magnitude spectrum of a sample frame  
   - Overlay a bank of mel-filters  
   - Slider to adjust the number of filters (e.g. 20–80) and see how coverage changes

2. **MFCC Reconstruction**  
   - Compute MFCCs from an audio clip  
   - Reconstruct time-domain audio using only the first 13 coefficients vs. the full log-spectrum  
   - Listen and compare how much fine detail is preserved

3. **Δ & ΔΔ Feature Visualization**  
   - Compute static MFCCs, Δ-MFCCs, and ΔΔ-MFCCs for a speech segment  
   - Plot their trajectories over time to observe how dynamics capture transitions

---

### 🛠 Exercise: Filter-Bank Shape Comparison
- **Task:** Evaluate how different auditory filter-bank shapes (triangular vs. gammatone) affect a simple speech-recognition pipeline.
- **Steps:**  
  1. Implement both triangular mel filters and gammatone filterbanks.  
  2. Extract filter-bank features from a small spoken-digit dataset.  
  3. Train a lightweight classifier (e.g. logistic regression) and compare recognition accuracy.  
- **Deliverables:**  
  - Plots of both filter-bank shapes overlaid on a spectrum.  
  - Recognition accuracy table for each filter type.  
  - Discussion of how filter-bank design influences feature discriminability.  


## 🔑 Key Concepts

- **Cepstrum & Quefrency Domain**  
  - The **cepstrum** is the “spectrum of a spectrum” — you take the log-magnitude of the Fourier transform, then apply an inverse Fourier transform to reveal **quefrency** (time-like) components.  
  - **Quefrency** bins correspond to periodicities in the log-spectrum, making it easy to separate vocal tract envelope from pitch harmonics.  
  - **Liftering** applies a window in the quefrency domain to isolate either the slowly-varying envelope (low-quefrency “filter” lifter) or the rapid pitch information (high-quefrency “source” lifter).  
  - This separation underpins techniques like **mel-cepstral analysis** and **pitch–envelope decomposition**.
    
- **MFCC Reconstruction**  
  - Extract the first 13 **Mel-Frequency Cepstral Coefficients** (MFCCs) from an audio signal.  
  - Reconstruct the time-domain waveform by inverting the MFCCs—first using only the low-order (1–13) “envelope” coefficients, then using the full log-spectrum.  
  - **Listen & compare:**  
    - **13-coef reconstruction** retains the broad spectral shape (timbre) but loses fine transient details.  
    - **Full-spectrum inversion** (all bins) recovers more detail at the cost of higher computational complexity.
      
- **Δ & ΔΔ (Temporal Derivatives)**  
  - Capture short-term dynamics by computing first (Δ) and second (ΔΔ) time-derivatives of cepstral or filter-bank features, highlighting how spectral patterns evolve over successive frames.



 
# 🎚️ Demo: Mel-Filter Bank Visualization

In this demo, you’ll explore how a **bank of mel-frequency filters** maps a **linear-frequency spectrum** into **perceptual bands**.

---

## 🔍 What the Code Does

1. **Loads your audio clip** and computes its **STFT magnitude** \( D \) using `N_FFT` and `HOP_LENGTH`.
2. Displays an **audio player** to listen to the **original signal**.
3. **Plots the magnitude spectrum** of a single STFT frame (`FRAME_IDX`) vs. frequency.
4. **Overlays a triangular mel-filter bank** (with `N_MELS` filters) on that spectrum.

---

## ⚙️ Inputs (Edit at the Top of the Code Cell)

- **`FILENAME`**  
  Name of your audio file in `sounds/` (supports `.wav` or `.mp3`)

- **`FRAME_IDX`**  
  Integer in `[0 … n_frames - 1]` selecting the time frame to display

- **`N_FFT`**  
  FFT window size (power of 2, e.g., `1024`, `2048`)

- **`HOP_LENGTH`**  
  Hop size between STFT frames (≤ `N_FFT`, e.g., `N_FFT // 4`)

- **`N_MELS`**  
  Number of mel filters (e.g., 20 to 80)

---

## 📤 Outputs to Observe

### 🎧 Audio Player
- Listen to the **original recording**

### 📈 Spectrum Plot
- **Blue curve**: Magnitude spectrum \|STFT\| at frame `FRAME_IDX`
- **Semi-transparent lines**: Each of the `N_MELS` **mel filters** overlaid on the spectrum

---

## 🧠 How to Interpret

### 🎼 Low Frequencies
- Mel filters are **narrower** and **densely packed**  
  → Yields **higher resolution**

### 🎼 High Frequencies
- Mel filters become **wider** and **sparser**  
  → Lower resolution, matching human perception

### 🎚️ Changing `N_MELS`
- **More filters** → Finer resolution (especially in low frequencies), but **more overlap**
- **Fewer filters** → Coarser band coverage

---

🔍 **Try editing `N_MELS`** in the code and **re-running the cell** to see how the **mel filter bank adapts** to match **human auditory perception**.


In [None]:
# ── USER SETTINGS ────────────────────────────────────────────────────────────────
FILENAME    = 'speech.WAV'   # ← place your audio file in `sounds/`
FRAME_IDX   = 0              # ← which STFT frame to visualize (0…n_frames-1)
N_FFT       = 2048           # ← FFT window size (power of 2)
HOP_LENGTH  = 512            # ← hop length between frames
N_MELS      = 40             # ← number of mel filters (e.g. 20…80)
# ────────────────────────────────────────────────────────────────────────────────

import numpy as np
import librosa
import matplotlib.pyplot as plt
from IPython.display import Audio, display
from pathlib import Path

# ── CONFIG (don’t edit below here) ───────────────────────────────────────────────
SOUNDS_DIR = Path('sounds')
audio_path = SOUNDS_DIR / FILENAME

# 1) Load audio and compute STFT magnitudes
y, sr = librosa.load(str(audio_path), sr=None)
D     = np.abs(librosa.stft(y, n_fft=N_FFT, hop_length=HOP_LENGTH))
freqs = librosa.fft_frequencies(sr=sr, n_fft=N_FFT)
n_frames = D.shape[1]

# 2) Listen to the original audio
print("▶️ Original Audio")
display(Audio(data=y, rate=sr, autoplay=False))

# 3) Compute mel filter bank
mel_fb = librosa.filters.mel(
    sr=sr,
    n_fft=N_FFT,
    n_mels=N_MELS
)

# 4) Pick and plot the chosen frame
mag_frame = D[:, FRAME_IDX]
plt.figure(figsize=(10, 4))
plt.plot(freqs, mag_frame, label=f'|STFT|, frame {FRAME_IDX}', color='C0')
for i in range(N_MELS):
    plt.plot(freqs, mel_fb[i], alpha=0.5, linewidth=1)
plt.title(f'Mel Filter Bank Overlay (n_mels = {N_MELS})')
plt.xlabel('Frequency (Hz)')
plt.ylabel('Amplitude')
plt.legend(loc='upper right')
plt.grid(True)
plt.tight_layout()
plt.show()


# 🎛️ Demo: MFCC Reconstruction Comparison

In this demo, you’ll explore how **retaining more (or fewer) cepstral coefficients** affects the **quality of a reconstructed audio signal**.

---

## 🔍 What the Code Does

1. **Loads your audio clip**.
2. Computes **MFCCs** using a **mel filter bank** of `N_MELS` bands and `N_FFT` FFT size:
   - **Short MFCC**: Keep only the first `N_MFCC` coefficients per frame  
     → Coarse spectral envelope
   - **Full MFCC**: Keep all `N_MELS` coefficients  
     → Retains full log-spectrum detail
3. **Inverts** each MFCC set back to time domain via  
   ```python
   librosa.feature.inverse.mfcc_to_audio
```
*(Uses **Griffin–Lim algorithm** under the hood)*

---

### ▶️ 4. Plays Back Three Versions

- **Original**  
- **Reconstructed from first `N_MFCC` coefficients**  
- **Reconstructed from all `N_MELS` coefficients**

---

## ⚙️ Inputs (Edit at Top of the Code Cell)

- **`FILENAME`**  
  Your input file in the `sounds/` folder (`.wav` or `.mp3`)

- **`N_MELS`** (e.g., 20–80)  
  Number of **mel bands** for the filter bank

- **`N_MFCC`** (≤ `N_MELS`, e.g., 8–20)  
  Number of **cepstral coefficients** to keep in the “short” version

- **`N_FFT`** (power of 2, e.g., 512, 1024, 2048)  
  **FFT window length**

- **`HOP_LENGTH`** (≤ `N_FFT`, typically `N_FFT // 4`)  
  **Frame shift** between successive STFT frames

---

## 📤 Outputs to Observe

### 🎧 Audio Players

- **Original**: Your untouched audio clip  
- **Short MFCC**: Captures **overall timbre** but may lose **transient fine-structure**  
- **Full MFCC**: Closer to the original, preserving more **spectral detail**

---

## 🧠 What to Listen For

### 🎼 Envelope vs. Detail
- Does the **“short” MFCC version** sound **muffled** or **smeared**?

### 🌀 Artifacts
- The **Griffin–Lim reconstruction** may introduce **phasiness** — how noticeable is it?

### 📉 Coefficient Count Impact
- How many **MFCCs** are needed to **approach the original quality**?

---

## 🧪 Experiment

Try varying `N_MFCC` (e.g., **5 → 13 → 40**) and observe how **reconstruction quality improves** as you include more **cepstral detail**.


In [None]:
# ── USER SETTINGS ────────────────────────────────────────────────────────────────
FILENAME    = 'drum_hit3.wav'  # ← place your file in `sounds/` (WAV or MP3)
N_MELS      = 40                     # ← number of mel bands used in MFCC
N_MFCC      = 13                     # ← number of MFCC coefficients to keep
N_FFT       = 2048                   # ← FFT window size (power of 2)
HOP_LENGTH  = N_FFT // 4             # ← hop length between frames
# ────────────────────────────────────────────────────────────────────────────────

import numpy as np
import librosa
import librosa.display
import matplotlib.pyplot as plt
from IPython.display import Audio, display
from pathlib import Path

# ── CONFIG (don’t edit below here) ───────────────────────────────────────────────
SOUNDS_DIR = Path('sounds')
audio_path = SOUNDS_DIR / FILENAME

# 1) Load audio
y, sr = librosa.load(str(audio_path), sr=None)

# 2) Compute MFCCs
#  - 'short' MFCCs: only first N_MFCC coefficients
mfcc_short = librosa.feature.mfcc(
    y=y, sr=sr,
    n_mfcc=N_MFCC,
    n_mels=N_MELS,
    n_fft=N_FFT,
    hop_length=HOP_LENGTH
)
#  - 'full' MFCCs: same number of bands as mel filters
mfcc_full = librosa.feature.mfcc(
    y=y, sr=sr,
    n_mfcc=N_MELS,
    n_mels=N_MELS,
    n_fft=N_FFT,
    hop_length=HOP_LENGTH
)

# 3) Reconstruct audio from MFCCs via inverse transform + Griffin-Lim
y_rec_short = librosa.feature.inverse.mfcc_to_audio(
    mfcc_short,
    sr=sr,
    n_mels=N_MELS,
    n_fft=N_FFT,
    hop_length=HOP_LENGTH
)
y_rec_full = librosa.feature.inverse.mfcc_to_audio(
    mfcc_full,
    sr=sr,
    n_mels=N_MELS,
    n_fft=N_FFT,
    hop_length=HOP_LENGTH
)

# 4) Playback: Original vs. Short MFCC vs. Full MFCC
print("▶️ Original Audio")
display(Audio(data=y,          rate=sr, autoplay=False))
print(f"▶️ Reconstructed from first {N_MFCC} MFCCs")
display(Audio(data=y_rec_short, rate=sr, autoplay=False))
print(f"▶️ Reconstructed from all {N_MELS} MFCCs")
display(Audio(data=y_rec_full,  rate=sr, autoplay=False))


# 🎛️ Demo: Δ & ΔΔ (Temporal Derivatives) Feature Visualization

In this demo, you’ll compute the **static MFCCs**, their **first (Δ)** and **second (ΔΔ)** derivatives, and **plot the trajectories** of the first few coefficients over time. This helps visualize how **dynamics in the cepstral domain** capture **onsets and transitions** in audio.

---

## ⚙️ User-Configurable Inputs  
*(Edit at the top of the code cell)*

- **`FILENAME`**  
  Name of your audio file in the `sounds/` folder (`.wav` or `.mp3`)

- **`SR`**  
  Sampling rate to use (`None` to use the file’s native rate)

- **`N_MFCC`**  
  Number of MFCC coefficients to compute (**≥ 3** recommended)

- **`N_FFT`**  
  FFT window size (power of two, e.g., `1024`, `2048`)

- **`HOP_LENGTH`**  
  Hop size between frames (≤ `N_FFT`, often `N_FFT // 4`)

---

## 🧪 What the Code Does

1. **Loads your clip**  
   *(Using `SoundFile` for `.wav`, `librosa` otherwise)*  
   Plays the audio

2. **Computes**:
   - **Static MFCCs** (`n_mfcc = N_MFCC`)
   - **Δ-MFCCs** (first derivative)
   - **ΔΔ-MFCCs** (second derivative)

3. **Plots** the time-series of the **first 3 coefficients**:
   - **Solid lines** = Static MFCC
   - **Dashed lines** = Δ-MFCC
   - **Dotted lines** = ΔΔ-MFCC

---

## 📤 Outputs to Observe

### 🎧 Audio Player
- Verify the clip **loaded and plays correctly**

### 📈 Trajectory Plots
- **Static MFCCs** show smooth **spectral envelope changes**
- **Δ (slope)** peaks at **onsets** or **rapid spectral shifts**
- **ΔΔ (acceleration)** highlights **changes in slope** — very sensitive to **transient bursts**

---

## 🧠 How to Interpret

- **Peaks in the Δ curve** correspond to **rhythmic or attack events**
- **Sharp spikes in ΔΔ** indicate **rapid changes in dynamics** or **timbre**
- Together, **Δ and ΔΔ features** capture **temporal patterns** that **static MFCCs alone cannot**, making them **crucial for tasks** like:
  - **Speech/music recognition**
  - **Onset detection**


In [None]:
# ── USER SETTINGS ────────────────────────────────────────────────────────────────
FILENAME    = 'speech.WAV'   # ← place your audio clip in `sounds/`
SR          = None           # ← sampling rate (None to use file’s native rate)
N_MFCC      = 13             # ← number of MFCC coefficients
N_FFT       = 2048           # ← FFT window size (power of two)
HOP_LENGTH  = N_FFT // 4     # ← hop length between frames
# ────────────────────────────────────────────────────────────────────────────────

import numpy as np
import matplotlib.pyplot as plt
import librosa
import librosa.display
from IPython.display import Audio, display
from pathlib import Path
import soundfile as sf

# ── CONFIG (don’t edit below here) ───────────────────────────────────────────────
SOUNDS_DIR = Path('sounds')
path = SOUNDS_DIR / FILENAME

def load_audio(path, sr=None):
    ext = path.suffix.lower()
    if ext == '.wav':
        y, sr_native = sf.read(str(path), dtype='float32')
        return (y, sr_native) if sr is None else (librosa.resample(y, sr_native, sr), sr)
    else:
        return librosa.load(str(path), sr=sr)

# 1) Load audio
y, sr = load_audio(path, sr=SR)

# 2) Play the clip
print("▶️ Original Audio")
display(Audio(data=y, rate=sr, autoplay=False))

# 3) Compute static MFCCs
mfcc     = librosa.feature.mfcc(
    y=y, sr=sr,
    n_mfcc=N_MFCC,
    n_fft=N_FFT,
    hop_length=HOP_LENGTH
)

# 4) Compute Δ and ΔΔ
mfcc_delta  = librosa.feature.delta(mfcc, order=1)
mfcc_delta2 = librosa.feature.delta(mfcc, order=2)

# 5) Prepare time axis
times = librosa.frames_to_time(
    np.arange(mfcc.shape[1]),
    sr=sr,
    hop_length=HOP_LENGTH
)

# 6) Plot trajectories for the first 3 coefficients
coeffs_to_plot = min(3, N_MFCC)
plt.figure(figsize=(12, 8))

for i in range(coeffs_to_plot):
    plt.subplot(coeffs_to_plot, 1, i+1)
    plt.plot(times, mfcc[i],       label=f'MFCC {i+1}',    linewidth=1.5)
    plt.plot(times, mfcc_delta[i],  '--', label='Δ-MFCC',    linewidth=1.2)
    plt.plot(times, mfcc_delta2[i], ':',  label='ΔΔ-MFCC',   linewidth=1.2)
    plt.title(f'Coefficient {i+1} Trends')
    plt.xlabel('Time (s)')
    plt.ylabel('Value')
    plt.legend(loc='upper right')
    plt.grid(True)

plt.tight_layout()
plt.show()


---

## 🛠 Exercise: Filter-Bank Shape Comparison

### 🎯 Task
Evaluate how different **auditory filter-bank shapes** — **triangular** (Mel) vs. **gammatone** — affect a **simple speech-recognition pipeline**.

---

### 🧾 Steps

1. **Implement both filter types**:
   - Triangular **mel filters** (e.g., via `librosa.filters.mel`)
   - **Gammatone filterbank** (e.g., using `auditory-toolbox`, `gammatone` library, or custom code)

2. **Extract filter-bank features** from a **small spoken-digit dataset**  
   *(e.g., Free Spoken Digit Dataset — FSDD)*

3. **Train a lightweight classifier**  
   (e.g., **logistic regression**, **SVM**, or **k-NN**)  
   to compare **recognition accuracy** across filter types

---

### 📦 Deliverables

- 📊 **Plots** of both **filter-bank shapes** overlaid on a representative **magnitude spectrum**
- 🧮 **Recognition accuracy table** comparing **mel vs. gammatone** features
- ✍️ **Brief discussion**:  
  - How does **filter-bank design** influence **feature discriminability**?
  - Which shape better captures important cues for digit classification?

---

💡 *Tip:* Try reducing dimensionality (e.g., via PCA) to visualize feature separability between digits for each filter type!
