# 🎓 Module 7: Hand-Crafted Feature Extraction

In this module, we’ll **revisit and extend time- and frequency-domain features**, then explore **temporal and statistical descriptors** that underpin many modern audio-analysis pipelines.

---

## 🔑 Key Concepts

### 🕒 Basic Time-Domain & Spectral Features
- **Zero-Crossing Rate**
- **RMS Energy**
- **Spectral Centroid**
- **Spectral Bandwidth**
- **Spectral Roll-off**

### 🕺 Temporal Features
- **Tempo Estimation**
- **Onset Detection**

### 📈 Statistical Descriptors
- **Higher-order moments of spectral frames**:
  - **Skewness**
  - **Kurtosis**

---

## 📓 Notebook Demos

### 🎧 Interactive Onset Detector
- Compute an **onset strength envelope** (e.g., `librosa.onset_strength`)
- Experiment with:
  - **Detection threshold**
  - **Hop length**
- **Listen to "clicks"** at detected onset times

---

### 🎤 MFCC vs. Spectral Contrast
- Extract **MFCCs** and **Spectral Contrast** from example clips
- Compare feature **trajectories** for **speech vs. music**
- Visualize:
  - **Mean**
  - **Variance** of each feature set

---

## 🛠 Exercise: Feature-Based Classification

### 🎯 Task:
Extract a bank of **20 hand-crafted features** including:
- **Time-domain**
- **Spectral**
- **Temporal**
- **Statistical** descriptors  
from a **collection of audio clips**.

### 📊 Analysis:
Implement a **simple k-Nearest Neighbors (k-NN) classifier** to distinguish **speech vs. music** using the extracted features.


### 📦 Deliverables

- ✅ Code to **compute and normalize** each feature
- 📁 **Feature matrices** + **labels**
- 📈 **Classification accuracy report** and **confusion matrix**


## 🔑 Key Concepts

### 🕒 Basic Time-Domain & Spectral Features

- **Zero-Crossing Rate (ZCR)**  
  The rate at which the signal changes sign (crosses zero) per time frame.  
  - **Why it matters:**  
    - High ZCR ⇒ noise-like or percussive sounds  
    - Low ZCR  ⇒ tonal, voiced speech, or sustained notes  

- **RMS Energy**  
  The root-mean-square of the signal amplitude over a short window.  
  - **Why it matters:**  
    - Measures perceptual loudness  
    - Used for voice-activity detection, dynamic range analysis  

- **Spectral Centroid**  
  The “center of mass” of the magnitude spectrum:  
  \[
    \text{centroid} = \frac{\sum_k f_k\,|X[k]|}{\sum_k |X[k]|}
  \]  
  - **Why it matters:**  
    - Often correlates with perceived “brightness” of a sound  
    - Higher centroid ⇒ brighter/tinny timbres  

- **Spectral Bandwidth**  
  The spread of the spectrum around its centroid (e.g.\ standard deviation):  
  \[
    \text{bandwidth} = \sqrt{\frac{\sum_k (f_k - \text{centroid})^2\,|X[k]|}{\sum_k |X[k]|}}
  \]  
  - **Why it matters:**  
    - Indicates how “wide” or “narrow” the spectrum is  
    - Wider bandwidth ⇒ richer harmonic content or noise  

- **Spectral Roll-off**  
  The frequency below which a fixed percentage (e.g.\ 85%) of the spectral energy is contained.  
  - **Why it matters:**  
    - Another measure of brightness / high-frequency content  
    - Useful for timbre classification and onset detection  


## 🔑 Key Concepts

### ⏱️ Temporal Features

- **Tempo Estimation**  
  Determining the underlying “beat” or pulse of a musical excerpt, typically measured in beats per minute (BPM).  
  - **Why it matters:**  
    - Crucial for synchronization (e.g. time‐stretching, beat‐matching)  
    - Used in music information retrieval and DJ software  

- **Onset Detection**  
  Identifying the precise time instants when new events (notes, drum hits, speech plosives) begin.  
  - **Why it matters:**  
    - Enables beat tracking, segmentation, and alignment  
    - Forms the backbone of rhythm analysis and transcription  
  - **Common method:**  
    - Compute an onset strength envelope (e.g. spectral flux, energy changes)  
    - Detect peaks above a threshold to mark onset times  


### 📈 Statistical Descriptors

- **Skewness**  
  A measure of the asymmetry of the spectral distribution around its mean.  
  - **Interpretation:**  
    - Positive skew ⇒ longer tail toward higher frequencies (more high-frequency energy)  
    - Negative skew ⇒ longer tail toward lower frequencies (more low-frequency energy)  
  - **Why it matters:**  
    - Can help distinguish instruments or sounds with non-symmetric spectral shapes  

- **Kurtosis**  
  A measure of the “peakedness” or tail weight of the spectral distribution.  
  - **Interpretation:**  
    - High kurtosis ⇒ sharp spectral peaks and heavy tails (tonal, resonant sounds)  
    - Low kurtosis ⇒ flat, broad spectrum (noisy or diffuse sounds)  
  - **Why it matters:**  
    - Useful for detecting transient vs. sustained content, or for timbre classification  


# 🎯 Demo: Interactive Onset Detection

In this demo, you’ll **compute and visualize an onset–strength envelope** to automatically detect **note or beat onsets** in an audio clip, and generate a **“click track”** at each detected onset.

---

## 🔍 What the Code Does

1. **Loads** your chosen audio file.
2. Computes the **onset strength envelope** using **sliding-window spectral flux**  
   *(via `librosa.onset_strength`)*.
3. **Peak-picks local maxima** in the envelope based on your parameters.
4. **Generates a click track** at the detected onset times  
   *(using `librosa.clicks`)*.
5. Plays back both the **original audio** and the **click track**.
6. **Plots**:
   - The **onset envelope** over time
   - The **detection threshold**
   - **Vertical lines** at each detected onset

---

## ⚙️ USER SETTINGS (at the top of the cell)

- **`FILENAME`**  
  Your audio file (`.wav` / `.mp3`) placed in the `sounds/` folder.

- **`HOP_LENGTH`** *(samples)*  
  Frame step for computing the onset envelope  
  *(typical values: 256–1024)*

- **`PRE_MAX`, `POST_MAX`** *(frames)*  
  Number of frames to look ahead/behind when identifying local peaks  
  *(values ≥ 1)*

- **`PRE_AVG`, `POST_AVG`** *(frames)*  
  Number of frames in the local average baseline  
  *(values ≥ 1)*

- **`DELTA`** *(onset-strength units)*  
  Minimum height above the local average to declare an onset  
  *(e.g. 0.1–1.0)*

- **`WAIT`** *(frames)*  
  Minimum frames between successive onsets  
  *(≥ 0)*

---

## 👀 What to Observe

### 🎧 Audio Players

- ▶️ **Original Audio** — listen to the raw signal
- ▶️ **Onset Clicks** — hear the discrete clicks marking each detected onset

---

### 📉 Threshold Effect

- **Raising `DELTA`** makes detection **more conservative** (fewer clicks)
- **Lowering `DELTA`** may detect **spurious** or noise-induced onsets

---

### 📊 Envelope & Detections Plot

- **Blue curve**: Onset strength over time
- **Dashed red line**: Your chosen **threshold** (`DELTA`)
- **Green vertical lines**: Detected **onset times**  
  → Check if they align with **perceptual beats** or **attacks** in the audio


In [None]:
# ── USER SETTINGS ────────────────────────────────────────────────────────────────
FILENAME    = 'grusel-melodie-305360.mp3'  # ← place your file in `sounds/`
HOP_LENGTH  = 512                         # ← hop length for onset envelope (samples)
PRE_MAX     = 1                           # ← lookahead for local peak (frames)
POST_MAX    = 1                           # ← lookbehind for local peak (frames)
PRE_AVG     = 1                           # ← lookahead for local average (frames)
POST_AVG    = 1                           # ← lookbehind for local average (frames)
DELTA       = 0.3                         # ← threshold for peak picking (onset strength units)
WAIT        = 0                           # ← minimal frames between onsets
# ────────────────────────────────────────────────────────────────────────────────

import numpy as np
import matplotlib.pyplot as plt
import librosa
import librosa.display
from IPython.display import Audio, display
from pathlib import Path

# ── CONFIG (don’t edit below here) ───────────────────────────────────────────────
SOUNDS_DIR = Path('sounds')
audio_path = SOUNDS_DIR / FILENAME

# 1) Load audio
y, sr = librosa.load(str(audio_path), sr=None)

# 2) Compute onset strength envelope
oenv = librosa.onset.onset_strength(y=y, sr=sr, hop_length=HOP_LENGTH)

# 3) Detect onset frames with peak-picking
onset_frames = librosa.onset.onset_detect(
    onset_envelope=oenv,
    sr=sr,
    hop_length=HOP_LENGTH,
    pre_max=PRE_MAX,
    post_max=POST_MAX,
    pre_avg=PRE_AVG,
    post_avg=POST_AVG,
    delta=DELTA,
    wait=WAIT,
    units='frames'
)
onset_times = librosa.frames_to_time(onset_frames, sr=sr, hop_length=HOP_LENGTH)

# 4) Generate click track at each onset time (updated API call)
click_track = librosa.clicks(times=onset_times, sr=sr, length=len(y))

# 5) Playback
print("▶️ Original Audio")
display(Audio(data=y,           rate=sr, autoplay=False))
print("▶️ Onset Clicks")
display(Audio(data=click_track, rate=sr, autoplay=False))

# 6) Plot onset envelope and detections
times = librosa.frames_to_time(np.arange(len(oenv)), sr=sr, hop_length=HOP_LENGTH)
plt.figure(figsize=(10, 4))
plt.plot(times, oenv, label='Onset Strength', color='C0')
plt.hlines(DELTA, times[0], times[-1], colors='C1', linestyles='--',
           label=f'Threshold = {DELTA}')
for t in onset_times:
    plt.axvline(t, color='C2', alpha=0.8)
plt.title('Onset Strength Envelope & Detected Onsets')
plt.xlabel('Time (s)')
plt.ylabel('Strength')
plt.legend(loc='upper right')
plt.grid(True)
plt.tight_layout()
plt.show()


# 🎛️ Demo: MFCC vs. Spectral Contrast Comparison

In this demo, you’ll **extract and compare two common feature sets**—**MFCCs** and **Spectral Contrast**—for a **speech clip** versus a **music clip**, and **visualize their summary statistics**.

---

## 🔍 What the Code Does

1. **Loads two audio files**:
   - One **speech** clip
   - One **music** clip  
   *(Handles unusual WAV codecs via `SoundFile`)*

2. Displays **audio players** so you can **listen** to each clip.

3. Computes:
   - **MFCCs (Mel-frequency cepstral coefficients)**
   - **Spectral Contrast** (difference between peaks and valleys in frequency sub-bands)

4. Calculates:
   - **Mean** and **variance** of each coefficient/band over time

5. Plots:
   - Two **side-by-side bar charts** for **MFCC mean & variance**
   - Two **side-by-side bar charts** for **Spectral Contrast mean & variance**

---

## ⚙️ USER SETTINGS

- **`FILENAME_SPEECH`**: Name of your **speech clip** in `sounds/` (`.wav` or `.mp3`)
- **`FILENAME_MUSIC`** : Name of your **music clip** in `sounds/` (`.wav` or `.mp3`)
- **`SR`**: Target **sampling rate** (`None` to keep native rate)
- **`N_MFCC`**: Number of **MFCC coefficients** (e.g., `13`)
- **`N_FFT`**: **FFT window length** (e.g., `2048`)
- **`HOP_LENGTH`**: Hop size between frames (e.g., `N_FFT // 4`)

---

## 👀 What to Observe

### 🎼 MFCCs
- Compare **average MFCC profiles** for **speech vs. music**:  
  - **Speech** typically shows a different **spectral envelope shape**
- **Variance** reveals how much each coefficient **fluctuates over time**  
  - Speech may have more dynamic changes in the **lower coefficients**

---

### 🎚️ Spectral Contrast
- **Mean contrast** shows typical **“peak-to-valley” differences** in each sub-band  
  - Music with strong harmonic content often has **higher contrast**
- **Variance** shows **temporal variability** in contrast  
  - **Percussive music** may produce **larger fluctuations**

---

🧠 **Insight**:  
By comparing **MFCCs and Spectral Contrast**, you can observe how **different types of audio** (speech vs. music) produce **distinct timbral signatures**—a crucial step for tasks like **classification, clustering, and audio understanding**.


In [None]:
# ── USER SETTINGS ────────────────────────────────────────────────────────────────
FILENAME_SPEECH = 'speech.WAV'    # ← place your speech clip in `sounds/`
FILENAME_MUSIC  = 'music.mp3'     # ← place your music clip in `sounds/`
SR              = None            # ← sampling rate (None to use file’s native rate)
N_MFCC          = 13              # ← number of MFCC coefficients
N_FFT           = 2048            # ← FFT window size (power of two)
HOP_LENGTH      = N_FFT // 4      # ← hop length between frames
# ────────────────────────────────────────────────────────────────────────────────

import numpy as np
import matplotlib.pyplot as plt
import librosa
import librosa.display
import soundfile as sf
from IPython.display import Audio, display
from pathlib import Path

# ── CONFIG (don’t edit below here) ───────────────────────────────────────────────
SOUNDS_DIR = Path('sounds')

def load_audio(path, sr=None):
    ext = path.suffix.lower()
    if ext == '.wav':
        y, sr_native = sf.read(str(path), dtype='float32')
        return (y, sr_native) if sr is None else (librosa.resample(y, sr_native, sr), sr)
    else:
        return librosa.load(str(path), sr=sr)

# 1) Load audio
path_s = SOUNDS_DIR / FILENAME_SPEECH
path_m = SOUNDS_DIR / FILENAME_MUSIC
y_s, sr_s = load_audio(path_s, sr=SR)
y_m, sr_m = load_audio(path_m, sr=SR)

# 2) Display audio players
print("▶️ Speech Clip")
display(Audio(data=y_s, rate=sr_s, autoplay=False))
print("▶️ Music Clip")
display(Audio(data=y_m, rate=sr_m, autoplay=False))

# 3) Extract MFCCs
mfcc_s = librosa.feature.mfcc(y=y_s, sr=sr_s,
                              n_mfcc=N_MFCC,
                              n_fft=N_FFT,
                              hop_length=HOP_LENGTH)
mfcc_m = librosa.feature.mfcc(y=y_m, sr=sr_m,
                              n_mfcc=N_MFCC,
                              n_fft=N_FFT,
                              hop_length=HOP_LENGTH)

# 4) Extract Spectral Contrast
contrast_s = librosa.feature.spectral_contrast(y=y_s, sr=sr_s,
                                               n_fft=N_FFT,
                                               hop_length=HOP_LENGTH)
contrast_m = librosa.feature.spectral_contrast(y=y_m, sr=sr_m,
                                               n_fft=N_FFT,
                                               hop_length=HOP_LENGTH)

# 5) Compute statistics
mfcc_mean_s = mfcc_s.mean(axis=1)
mfcc_var_s  = mfcc_s.var(axis=1)
mfcc_mean_m = mfcc_m.mean(axis=1)
mfcc_var_m  = mfcc_m.var(axis=1)

cont_mean_s = contrast_s.mean(axis=1)
cont_var_s  = contrast_s.var(axis=1)
cont_mean_m = contrast_m.mean(axis=1)
cont_var_m  = contrast_m.var(axis=1)

# 6) Plot MFCC mean & variance with constrained layout
fig, axs = plt.subplots(1, 2, figsize=(14, 5), constrained_layout=True)
idx = np.arange(N_MFCC)

axs[0].bar(idx - 0.2, mfcc_mean_s, width=0.4, label='Speech')
axs[0].bar(idx + 0.2, mfcc_mean_m, width=0.4, label='Music')
axs[0].set_title('MFCC Mean')
axs[0].set_xlabel('Coefficient Index')
axs[0].set_ylabel('Mean Value')
axs[0].legend()
axs[0].grid(True)

axs[1].bar(idx - 0.2, mfcc_var_s, width=0.4, label='Speech')
axs[1].bar(idx + 0.2, mfcc_var_m, width=0.4, label='Music')
axs[1].set_title('MFCC Variance')
axs[1].set_xlabel('Coefficient Index')
axs[1].set_ylabel('Variance')
axs[1].legend()
axs[1].grid(True)

fig.suptitle('MFCC Statistics: Speech vs. Music', y=1.02)
plt.show()

# 7) Plot Spectral Contrast mean & variance with constrained layout
n_bands = contrast_s.shape[0]
idx_c   = np.arange(n_bands)

fig, axs = plt.subplots(1, 2, figsize=(14, 5), constrained_layout=True)

axs[0].bar(idx_c - 0.2, cont_mean_s, width=0.4, label='Speech')
axs[0].bar(idx_c + 0.2, cont_mean_m, width=0.4, label='Music')
axs[0].set_title('Spectral Contrast Mean')
axs[0].set_xlabel('Frequency Band Index')
axs[0].set_ylabel('Mean Value (dB)')
axs[0].legend()
axs[0].grid(True)

axs[1].bar(idx_c - 0.2, cont_var_s, width=0.4, label='Speech')
axs[1].bar(idx_c + 0.2, cont_var_m, width=0.4, label='Music')
axs[1].set_title('Spectral Contrast Variance')
axs[1].set_xlabel('Frequency Band Index')
axs[1].set_ylabel('Variance')
axs[1].legend()
axs[1].grid(True)

fig.suptitle('Spectral Contrast Statistics: Speech vs. Music', y=1.02)
plt.show()


## 🛠 Exercise: Feature-Based Classification

### 🎯 Task:
Extract a bank of **20 hand-crafted features**—spanning time-domain, spectral, temporal, and statistical descriptors—from a collection of audio clips (speech and music).

### 📊 Analysis:
1. **Feature Extraction**  
   - Compute features such as zero-crossing rate, RMS energy, spectral centroid, bandwidth, roll-off, tempo, onset rate, skewness, kurtosis, MFCC statistics, spectral contrast statistics, etc., for each clip.  
   - Normalize or standardize each feature across the dataset.

2. **k-Nearest Neighbors Classification**  
   - Split your dataset into training and test sets.  
   - Train a simple k-NN classifier on the feature vectors to distinguish **speech vs. music**.  
   - Tune the number of neighbors \(k\) via cross-validation.

### 📦 Deliverables:
- ✅ A script or notebook section that **computes and normalizes** all 20 features for each audio clip.  
- 📁 Two arrays/matrices:  
  - **X** of shape \((n\_samples, 20)\) containing feature vectors  
  - **y** of shape \((n\_samples,)\) containing binary labels (0 = speech, 1 = music)  
- 📈 A **classification report** showing accuracy, precision, recall, and F1-score.  
- 📉 A **confusion matrix** visualization for your test set.  
