# Environmental Sound Classification with ESC-50

In previous classes, we explored audio signal processing, built a digit recognizer, trained emotion classifiers, and used Whisper for speech recognition. All of these focused on **human speech**. But the world is full of non-speech sounds too!

**Environmental Sound Classification (ESC)** is the task of automatically identifying sounds in our environment — a dog barking, rain falling, a siren wailing, or a keyboard clicking.

### What you'll learn:

1. **The ESC-50 dataset** — a benchmark collection of 2,000 environmental sounds across 50 categories
2. **Data exploration** — listening to, visualizing, and understanding different types of sounds
3. **Audio features** — extracting meaningful numbers from sound (MFCCs, spectral features)
4. **Classical ML** — training Random Forest and SVM classifiers (fast, no GPU needed!)
5. **Pretrained deep learning** — using an Audio Spectrogram Transformer already trained on ESC-50
6. **Evaluation** — confusion matrices, per-class accuracy, and understanding model mistakes
7. **Real-time classification** — identifying sounds from your microphone
8. **Triggering actions** — using sound classification to control other systems (OSC, Serial, Arduino)

### Why Environmental Sound Classification?

| Application | Example |
|-------------|--------|
| **Smart Homes** | Detect a smoke alarm, glass breaking, or doorbell |
| **Wildlife Monitoring** | Identify bird species, detect poachers |
| **Urban Planning** | Map noise pollution, detect traffic patterns |
| **Accessibility** | Alert deaf/hard-of-hearing people to important sounds |
| **Creative Coding** | Trigger visuals or music based on environmental sounds |
| **Security** | Detect gunshots, screams, or breaking glass |
| **Healthcare** | Monitor coughing, snoring, or breathing patterns |

### Where the Field Stands

Accuracy on ESC-50 has climbed dramatically thanks to **transfer learning** — using models pretrained on large audio/image datasets:

| Year | Approach | ESC-50 Accuracy | Key Idea |
|------|----------|----------------|----------|
| 2015 | Piczak CNN (from scratch) | ~64% | Train a CNN on mel spectrograms |
| 2020 | PANNs (Kong et al.) | ~95% | Pretrain CNNs on AudioSet (2M clips), fine-tune |
| 2021 | AST (Gong et al.) | ~96% | Apply a Vision Transformer to spectrograms |
| 2022 | BEATs (Chen et al.) | ~98% | Self-supervised pretraining with audio tokenizers |
| — | **Human listeners** | **~81%** | Reported in the original ESC-50 paper |

The big insight: **treating mel spectrograms as images** and using image classification models (CNNs, Vision Transformers) pretrained on millions of examples works remarkably well for audio. In this notebook, you'll experience this progression firsthand!

### Background Reading & Key Papers

- [ESC-50 Paper](https://www.karolpiczak.com/papers/Piczak2015-ESC-Dataset.pdf) — Piczak (2015), the original dataset paper
- [PANNs](https://arxiv.org/abs/1912.10211) — Kong et al. (2020), systematic AudioSet pretraining
- [Audio Spectrogram Transformer](https://arxiv.org/abs/2104.01778) — Gong et al. (2021), Vision Transformer for audio
- [BEATs](https://arxiv.org/abs/2212.09058) — Chen et al. (2022), self-supervised SOTA
- [HuggingFace Audio Course (Ch. 4)](https://huggingface.co/learn/audio-course/chapter4/introduction) — Free tutorial on pretrained audio models
- [ESC-50 GitHub](https://github.com/karolpiczak/ESC-50) — Official dataset repository

---

## Part 1: Setup and Installation

We need a couple of new libraries for this class.

**New libraries:**

| Library | What it does |
|---------|-------------|
| `transformers` | HuggingFace library for loading pretrained AI models |
| `datasets` | HuggingFace library for downloading datasets |

**Libraries from previous classes (should already be installed):**

| Library | What it does |
|---------|-------------|
| `librosa` | Audio analysis and feature extraction |
| `sounddevice` | Recording audio from your microphone |
| `soundfile` | Reading and writing audio files |
| `scikit-learn` | Machine learning models and evaluation tools |
| `matplotlib` / `seaborn` | Charts and visualizations |
| `pandas` | Working with tabular data |
| `numpy` | Numerical computing |

**Install new libraries:**

```bash
uv pip install transformers datasets
```

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import librosa
import librosa.display
import sounddevice as sd
import soundfile as sf
import os
import torch

from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler
from datasets import load_dataset
from transformers import pipeline
from tqdm.notebook import tqdm
from IPython.display import Audio, display

import warnings
warnings.filterwarnings('ignore')

print("All libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"PyTorch version: {torch.__version__}")

if torch.cuda.is_available():
    print(f"GPU available: {torch.cuda.get_device_name(0)}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    print("Apple Silicon (MPS) detected")
else:
    print("Using CPU (this is fine for everything in this notebook!)")

## Part 2: The ESC-50 Dataset

### What is ESC-50?

ESC-50 is a labeled collection of **2,000 environmental audio recordings** designed for benchmarking sound classification methods. It was created by Karol Piczak in 2015 and is one of the most popular benchmarks in audio research.

| Property | Value |
|----------|-------|
| Total recordings | 2,000 |
| Categories | 50 |
| Clips per category | 40 |
| Clip duration | 5 seconds |
| Sample rate | 44,100 Hz |
| Format | WAV (mono) |
| Folds | 5 (for cross-validation) |

The 50 categories are organized into **5 major groups**:

1. **Animals** — dog, cat, rooster, crow, frog...
2. **Natural soundscapes** — rain, sea waves, thunderstorm, wind...
3. **Human (non-speech)** — crying baby, clapping, laughing, coughing...
4. **Domestic/Interior** — clock tick, door knock, keyboard typing, vacuum cleaner...
5. **Urban/Exterior** — helicopter, siren, car horn, train, chainsaw...

In [None]:
# Download the ESC-50 dataset from HuggingFace
# First time: downloads ~600MB of audio. After that: loads from cache instantly!

print("Loading ESC-50 dataset from HuggingFace...")
print("(First time: downloads ~600MB. After that: loads from cache)\n")

dataset = load_dataset("ashraq/esc50")
esc50 = dataset['train']

print(f"Dataset loaded! {len(esc50)} audio clips")
print(f"Columns: {esc50.column_names}")

# Look at the first sample
sample = esc50[0]
print(f"\nFirst sample:")
for key, value in sample.items():
    if key == 'audio':
        print(f"  audio: numpy array with {len(value['array'])} samples at {value['sampling_rate']} Hz")
    else:
        print(f"  {key}: {value}")

In [None]:
# Create a pandas DataFrame for easy exploration

df = pd.DataFrame({
    'filename': [esc50[i]['filename'] for i in range(len(esc50))],
    'fold': [esc50[i]['fold'] for i in range(len(esc50))],
    'target': [esc50[i]['target'] for i in range(len(esc50))],
    'category': [esc50[i]['category'] for i in range(len(esc50))],
    'esc10': [esc50[i]['esc10'] for i in range(len(esc50))],
})

print("ESC-50 Dataset Overview:")
print("=" * 50)
print(f"Total samples: {len(df)}")
print(f"Number of categories: {df['category'].nunique()}")
print(f"Samples per category: {len(df) // df['category'].nunique()}")
print(f"Number of folds: {df['fold'].nunique()}")
print(f"\nFold distribution:")
print(df['fold'].value_counts().sort_index())
print(f"\nFirst few rows:")
df.head(10)

In [None]:
# ESC-50's 50 classes organized into 5 major categories

MAJOR_CATEGORIES = {
    'Animals': ['dog', 'rooster', 'pig', 'cow', 'frog', 'cat', 'hen', 'insects', 'sheep', 'crow'],
    'Natural': ['rain', 'sea_waves', 'crackling_fire', 'crickets', 'chirping_birds',
                'water_drops', 'wind', 'pouring_water', 'toilet_flush', 'thunderstorm'],
    'Human': ['crying_baby', 'sneezing', 'clapping', 'breathing', 'coughing',
              'footsteps', 'laughing', 'brushing_teeth', 'snoring', 'drinking_sipping'],
    'Domestic': ['door_knock', 'mouse_click', 'keyboard_typing', 'door_wood_creaks',
                 'can_opening', 'washing_machine', 'vacuum_cleaner', 'clock_alarm',
                 'clock_tick', 'glass_breaking'],
    'Urban': ['helicopter', 'chainsaw', 'siren', 'car_horn', 'engine',
              'train', 'church_bells', 'airplane', 'fireworks', 'hand_saw'],
}

def get_major_category(category):
    """Return which major category a sound class belongs to."""
    for major, classes in MAJOR_CATEGORIES.items():
        if category in classes:
            return major
    return 'Unknown'

df['major_category'] = df['category'].apply(get_major_category)

# Visualize the dataset
COLORS = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

category_counts = df['major_category'].value_counts()
axes[0].bar(category_counts.index, category_counts.values, color=COLORS)
axes[0].set_title('Samples per Major Category')
axes[0].set_ylabel('Number of Samples')
axes[0].tick_params(axis='x', rotation=45)

fold_counts = df['fold'].value_counts().sort_index()
axes[1].bar(fold_counts.index, fold_counts.values, color='steelblue')
axes[1].set_title('Samples per Fold (for Cross-Validation)')
axes[1].set_ylabel('Number of Samples')
axes[1].set_xlabel('Fold')

plt.tight_layout()
plt.show()

# Show what's in each category
print("\nMajor category breakdown:")
for cat in MAJOR_CATEGORIES:
    classes = MAJOR_CATEGORIES[cat]
    print(f"  {cat}: {', '.join(classes)}")

### What are Folds?

ESC-50 comes with **5 pre-defined folds** for cross-validation. Each fold contains 400 clips (8 per category).

The folds are designed so that clips from the **same original source recording** are always in the same fold. This prevents **data leakage** — the same issue we discussed in Class 2!

For evaluation, we'll use **folds 1–4 for training** and **fold 5 for testing**.

## Part 3: Listening to Environmental Sounds

The most important step in any ML project: **understand your data before building models**.

Let's listen to examples from each major category and visualize what they look like!

In [None]:
# Listen to one example from each major category

print("One example from each major category:")
print("=" * 60)

for major_cat, classes in MAJOR_CATEGORIES.items():
    example_class = classes[0]

    for i in range(len(esc50)):
        if esc50[i]['category'] == example_class:
            sample = esc50[i]
            break

    audio = sample['audio']['array']
    sr = sample['audio']['sampling_rate']

    print(f"\n{major_cat} — {example_class}")
    display(Audio(audio, rate=sr))

In [None]:
# Compare waveforms of 5 different sounds — one from each major category
# Notice how different types of sounds have very different shapes!

example_sounds = ['dog', 'rain', 'clapping', 'keyboard_typing', 'siren']

fig, axes = plt.subplots(5, 1, figsize=(14, 12))

for idx, sound_name in enumerate(example_sounds):
    for i in range(len(esc50)):
        if esc50[i]['category'] == sound_name:
            sample = esc50[i]
            break

    audio = sample['audio']['array']
    sr = sample['audio']['sampling_rate']
    duration = len(audio) / sr
    time = np.linspace(0, duration, len(audio))

    axes[idx].plot(time, audio, color=COLORS[idx], alpha=0.8, linewidth=0.5)
    axes[idx].set_ylabel('Amplitude')
    axes[idx].set_title(f'{sound_name} ({get_major_category(sound_name)})')
    axes[idx].set_xlim(0, duration)
    axes[idx].grid(True, alpha=0.3)

axes[-1].set_xlabel('Time (seconds)')
plt.suptitle('Comparing Waveforms of Different Environmental Sounds', fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

print("Notice the different patterns:")
print("  - Dog bark: short, loud bursts")
print("  - Rain: continuous random noise")
print("  - Clapping: sharp periodic impacts")
print("  - Keyboard: rapid quiet clicks")
print("  - Siren: smooth periodic oscillation")

In [None]:
# Compare mel spectrograms of different sounds
# Spectrograms show frequency content over time — this is what ML models "see"!

fig, axes = plt.subplots(5, 1, figsize=(14, 14))

for idx, sound_name in enumerate(example_sounds):
    for i in range(len(esc50)):
        if esc50[i]['category'] == sound_name:
            sample = esc50[i]
            break

    audio = sample['audio']['array'].astype(np.float32)
    sr = sample['audio']['sampling_rate']

    S = librosa.feature.melspectrogram(y=audio, sr=sr, n_mels=128)
    S_db = librosa.power_to_db(S, ref=np.max)

    librosa.display.specshow(S_db, x_axis='time', y_axis='mel', sr=sr, ax=axes[idx])
    axes[idx].set_title(f'{sound_name} ({get_major_category(sound_name)})')
    axes[idx].label_outer()

plt.suptitle('Mel Spectrograms of Different Environmental Sounds', fontsize=14, y=1.01)
plt.tight_layout()
plt.show()

print("Each sound has a unique 'fingerprint' in its spectrogram:")
print("  - Dog bark: bright horizontal bands (harmonics) in short bursts")
print("  - Rain: uniform energy spread across all frequencies")
print("  - Clapping: vertical lines (broadband impulses)")
print("  - Keyboard: faint, quick vertical marks")
print("  - Siren: rising and falling frequency sweep")
print("\nThese visual patterns are exactly what neural networks learn to recognize!")

In [None]:
# Listen to multiple examples of the SAME category
# Even within one class, there's a lot of variation!

explore_category = 'dog'  # Change this to explore other categories!

print(f"Four examples of '{explore_category}':")
print("Notice how much variation there is within a single category!\n")

count = 0
for i in range(len(esc50)):
    if esc50[i]['category'] == explore_category and count < 4:
        sample = esc50[i]
        audio = sample['audio']['array']
        sr = sample['audio']['sampling_rate']
        print(f"Example {count + 1} ({sample['filename']}):")
        display(Audio(audio, rate=sr))
        count += 1

## Part 4: Feature Extraction

Before we can train a machine learning model, we need to convert audio into **numbers** that the model can work with. This is called **feature extraction**.

### What are Audio Features?

Think of features as a compact summary of what makes a sound unique. Instead of feeding the model millions of raw audio samples, we extract a description.

| Feature | What it captures | Example |
|---------|-----------------|--------|
| **MFCCs** | Tonal quality (timbre) | Distinguishes a dog bark from a cat meow |
| **Chroma** | Musical pitch content | Tells apart church bells from a siren |
| **Spectral Centroid** | "Brightness" of the sound | A whistle is bright, thunder is dark |
| **Spectral Bandwidth** | How spread out the frequencies are | White noise is wide, a tone is narrow |
| **Spectral Rolloff** | Where most of the energy is | High for hissing, low for rumbling |
| **Zero Crossing Rate** | How often the signal crosses zero | High for noisy sounds, low for tonal |
| **RMS Energy** | Loudness over time | Loud vs. quiet sounds |

**MFCCs (Mel-Frequency Cepstral Coefficients)** are the most important features for sound classification. They capture the "shape" of the sound's frequency spectrum in a way similar to how humans hear.

In [None]:
# Extract features from a single audio sample to see what they look like

sample = esc50[0]
audio = sample['audio']['array'].astype(np.float32)
sr = sample['audio']['sampling_rate']

print(f"Sample: {sample['category']} ({sample['filename']})")
print(f"Audio shape: {audio.shape}, Sample rate: {sr} Hz")
print(f"Duration: {len(audio)/sr:.1f} seconds\n")

# Extract various features
mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13)
chroma = librosa.feature.chroma_stft(y=audio, sr=sr)
spectral_centroid = librosa.feature.spectral_centroid(y=audio, sr=sr)
spectral_bandwidth = librosa.feature.spectral_bandwidth(y=audio, sr=sr)
spectral_rolloff = librosa.feature.spectral_rolloff(y=audio, sr=sr)
zero_crossing_rate = librosa.feature.zero_crossing_rate(audio)
rms = librosa.feature.rms(y=audio)

print("Feature shapes (values for each time frame):")
print(f"  MFCCs:              {mfccs.shape} (13 coefficients x {mfccs.shape[1]} frames)")
print(f"  Chroma:             {chroma.shape} (12 pitch classes x {chroma.shape[1]} frames)")
print(f"  Spectral Centroid:  {spectral_centroid.shape}")
print(f"  Spectral Bandwidth: {spectral_bandwidth.shape}")
print(f"  Spectral Rolloff:   {spectral_rolloff.shape}")
print(f"  Zero Crossing Rate: {zero_crossing_rate.shape}")
print(f"  RMS Energy:         {rms.shape}")

# Visualize MFCCs
fig, axes = plt.subplots(2, 1, figsize=(12, 6))

img = librosa.display.specshow(mfccs, x_axis='time', sr=sr, ax=axes[0])
axes[0].set_title(f'MFCCs for "{sample["category"]}"')
axes[0].set_ylabel('MFCC Coefficient')
plt.colorbar(img, ax=axes[0])

S = librosa.feature.melspectrogram(y=audio, sr=sr)
S_db = librosa.power_to_db(S, ref=np.max)
librosa.display.specshow(S_db, x_axis='time', y_axis='mel', sr=sr, ax=axes[1])
times = librosa.times_like(spectral_centroid)
axes[1].plot(times, spectral_centroid[0], color='white', linewidth=2, label='Spectral Centroid')
axes[1].legend()
axes[1].set_title('Spectral Centroid (white line) on Mel Spectrogram')

plt.tight_layout()
plt.show()

In [None]:
# Extract features from ALL 2,000 audio clips
# We summarize each feature over time (mean + standard deviation) to get
# a fixed-length vector per clip. This is the input for our ML classifiers.

def extract_features(audio, sr):
    """Extract audio features and return a fixed-length feature vector."""
    features = []

    # MFCCs (13 coefficients x 2 stats = 26 features)
    mfccs = librosa.feature.mfcc(y=audio, sr=sr, n_mfcc=13)
    features.extend(np.mean(mfccs, axis=1))
    features.extend(np.std(mfccs, axis=1))

    # Chroma (12 pitch classes x 2 stats = 24 features)
    chroma = librosa.feature.chroma_stft(y=audio, sr=sr)
    features.extend(np.mean(chroma, axis=1))
    features.extend(np.std(chroma, axis=1))

    # Spectral Centroid (2 features)
    spectral_centroid = librosa.feature.spectral_centroid(y=audio, sr=sr)
    features.extend([np.mean(spectral_centroid), np.std(spectral_centroid)])

    # Spectral Bandwidth (2 features)
    spectral_bandwidth = librosa.feature.spectral_bandwidth(y=audio, sr=sr)
    features.extend([np.mean(spectral_bandwidth), np.std(spectral_bandwidth)])

    # Spectral Rolloff (2 features)
    spectral_rolloff = librosa.feature.spectral_rolloff(y=audio, sr=sr)
    features.extend([np.mean(spectral_rolloff), np.std(spectral_rolloff)])

    # Zero Crossing Rate (2 features)
    zero_crossing_rate = librosa.feature.zero_crossing_rate(audio)
    features.extend([np.mean(zero_crossing_rate), np.std(zero_crossing_rate)])

    # RMS Energy (2 features)
    rms = librosa.feature.rms(y=audio)
    features.extend([np.mean(rms), np.std(rms)])

    return np.array(features)


# Extract features from all samples
print(f"Extracting features from all {len(esc50)} samples...")
print("This takes 2-5 minutes depending on your machine.\n")

all_features = []
all_labels = []
all_folds = []
all_categories = []

for i in tqdm(range(len(esc50))):
    sample = esc50[i]
    audio = sample['audio']['array'].astype(np.float32)
    sr = sample['audio']['sampling_rate']

    try:
        features = extract_features(audio, sr)
        all_features.append(features)
        all_labels.append(sample['target'])
        all_folds.append(sample['fold'])
        all_categories.append(sample['category'])
    except Exception as e:
        print(f"Error on sample {i} ({sample['filename']}): {e}")

X = np.array(all_features)
y = np.array(all_labels)
folds = np.array(all_folds)
categories = np.array(all_categories)

print(f"\nFeature matrix shape: {X.shape} (samples x features)")
print(f"Labels shape: {y.shape}")
print(f"Total features per sample: {X.shape[1]}")
print("Done! Ready for classification.")

## Part 5: Classical ML Classification

Now that we have features extracted from every audio clip, we can train machine learning classifiers! These are similar to the models you may have used in Class 3.

### Train/Test Split

We'll use ESC-50's built-in folds: **folds 1–4 for training** (1,600 samples) and **fold 5 for testing** (400 samples).

### Models We'll Train

| Model | How it works | Strengths |
|-------|-----------|-----------|
| **Random Forest** | Ensemble of decision trees that vote on the answer | Fast, hard to overfit, handles many features well |
| **SVM** | Finds optimal boundaries between classes | Great for high-dimensional data |

In [None]:
# Split data using ESC-50's built-in folds

train_mask = folds != 5
test_mask = folds == 5

X_train, X_test = X[train_mask], X[test_mask]
y_train, y_test = y[train_mask], y[test_mask]
categories_test = categories[test_mask]

print(f"Training samples: {len(X_train)}")
print(f"Test samples:     {len(X_test)}")

# Normalize features (important for SVM!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Random Forest
print("\nTraining Random Forest...")
rf = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1)
rf.fit(X_train_scaled, y_train)
rf_accuracy = rf.score(X_test_scaled, y_test)
print(f"  Random Forest accuracy: {rf_accuracy:.1%}")

# Train SVM
print("\nTraining SVM...")
svm = SVC(kernel='rbf', C=10, gamma='scale', random_state=42)
svm.fit(X_train_scaled, y_train)
svm_accuracy = svm.score(X_test_scaled, y_test)
print(f"  SVM accuracy: {svm_accuracy:.1%}")

print(f"\nBest classical ML model: {'Random Forest' if rf_accuracy > svm_accuracy else 'SVM'}")
print(f"(For reference: reported human accuracy on ESC-50 is about 81%)")

In [None]:
# Detailed evaluation: confusion matrix and classification report

best_model = rf if rf_accuracy >= svm_accuracy else svm
best_name = "Random Forest" if rf_accuracy >= svm_accuracy else "SVM"

y_pred = best_model.predict(X_test_scaled)

unique_categories = sorted(set(all_categories))

print(f"Classification Report ({best_name}):")
print("=" * 70)
print(classification_report(y_test, y_pred, target_names=unique_categories, zero_division=0))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(16, 14))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=unique_categories,
            yticklabels=unique_categories)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title(f'Confusion Matrix — {best_name} (Accuracy: {accuracy_score(y_test, y_pred):.1%})')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

In [None]:
# Which sounds are easiest and hardest to classify?

per_class_acc = {}
for i, cat in enumerate(unique_categories):
    mask = y_test == i
    if mask.sum() > 0:
        per_class_acc[cat] = (y_pred[mask] == y_test[mask]).mean()

sorted_acc = sorted(per_class_acc.items(), key=lambda x: x[1], reverse=True)

fig, ax = plt.subplots(figsize=(14, 10))
cats = [x[0] for x in sorted_acc]
accs = [x[1] for x in sorted_acc]
bar_colors = ['#2ecc71' if a >= 0.7 else '#f39c12' if a >= 0.4 else '#e74c3c' for a in accs]

ax.barh(range(len(cats)), accs, color=bar_colors)
ax.set_yticks(range(len(cats)))
ax.set_yticklabels(cats, fontsize=8)
ax.set_xlabel('Accuracy')
ax.set_title(f'Per-Class Accuracy — {best_name}')
ax.axvline(x=accuracy_score(y_test, y_pred), color='black', linestyle='--',
           label=f'Overall: {accuracy_score(y_test, y_pred):.1%}')
ax.legend()
ax.set_xlim(0, 1.05)
plt.tight_layout()
plt.show()

# Top confusions
print("Top 10 Most Confused Sound Pairs:")
print("=" * 55)
confusions = []
for i in range(len(unique_categories)):
    for j in range(len(unique_categories)):
        if i != j and cm[i][j] > 0:
            confusions.append((unique_categories[i], unique_categories[j], cm[i][j]))

confusions.sort(key=lambda x: x[2], reverse=True)
for true_cat, pred_cat, count in confusions[:10]:
    print(f"  {true_cat:20s} mistaken for {pred_cat:20s} ({count} times)")

### Understanding the Results

The confusion matrix and per-class accuracy reveal interesting patterns:

- **Easy sounds** tend to be very distinctive (e.g., helicopter, clock tick, siren)
- **Hard sounds** often share acoustic properties:
  - Rain vs. water sounds (both are broadband noise)
  - Insects vs. crickets (similar continuous high-frequency sounds)
  - Engine vs. train (both are low-frequency rumbling)

**Key takeaway:** The features we extracted (MFCCs, spectral features) capture a lot of useful information, but they're just summaries. A neural network looking at the **full spectrogram** can catch patterns that our handcrafted features miss.

Let's see how a pretrained deep learning model compares...

---

## Part 6: CNN on Spectrograms — The "Audio as Image" Paradigm

The single most important insight in modern audio classification:

> **A mel spectrogram is a 2D image.** Any image classification model can classify it.

This means we can take a neural network that was trained on millions of **photographs** (ImageNet) and use it to classify **sounds**! This is called **transfer learning** — the features the network learned for recognizing objects in photos (edges, textures, repeating patterns) turn out to be useful for recognizing patterns in spectrograms too.

### The Plan

1. Convert each ESC-50 audio clip to a mel spectrogram "image"
2. Load a **ResNet-18** pretrained on ImageNet (1.2 million photos, 1,000 object classes)
3. Replace its final layer to output 50 ESC-50 classes instead of 1,000 ImageNet classes
4. **Fine-tune** it on our spectrogram images

### What is ResNet?

**ResNet** (Residual Network) is one of the most influential neural network architectures in deep learning. It uses "skip connections" that let information flow more easily through the network, which solved the problem of training very deep networks.

| Component | Purpose |
|-----------|---------|
| Conv layers 1–17 | Extract visual features (edges → textures → patterns) |
| Final FC layer | Classify based on extracted features (we replace this!) |
| Skip connections | Help gradients flow during training |

### Training Time

| Device | Approximate Time |
|--------|-----------------|
| CPU (Intel) | ~20-40 minutes |
| Apple Silicon (MPS) | ~5-15 minutes |
| GPU (CUDA) | ~2-5 minutes |

We'll freeze the early layers (which already know how to extract useful features) and only train the later layers. This speeds up training significantly!

In [None]:
# Step 1: Create a PyTorch Dataset that converts audio clips to spectrogram "images"

from torchvision import models
from torchvision import transforms as tv_transforms

class SpectrogramDataset(torch.utils.data.Dataset):
    """Convert ESC-50 audio clips into mel spectrogram images for a CNN."""

    def __init__(self, indices, esc50_dataset):
        self.indices = indices
        self.esc50 = esc50_dataset
        self.resize = tv_transforms.Resize((224, 224), antialias=True)
        # ImageNet normalization — ResNet was trained with these statistics
        self.normalize = tv_transforms.Normalize(
            mean=[0.485, 0.456, 0.406],
            std=[0.229, 0.224, 0.225]
        )

    def __len__(self):
        return len(self.indices)

    def __getitem__(self, idx):
        sample = self.esc50[self.indices[idx]]
        audio = sample['audio']['array'].astype(np.float32)
        sr = sample['audio']['sampling_rate']

        # Create mel spectrogram (same transform as our visualizations above!)
        S = librosa.feature.melspectrogram(y=audio, sr=sr, n_mels=128)
        S_db = librosa.power_to_db(S, ref=np.max)

        # Normalize to 0-1 range
        S_db = (S_db - S_db.min()) / (S_db.max() - S_db.min() + 1e-8)

        # ResNet expects 3-channel (RGB) input — repeat the spectrogram 3 times
        S_3ch = np.stack([S_db, S_db, S_db], axis=0)
        tensor = torch.tensor(S_3ch, dtype=torch.float32)

        # Resize to 224x224 (ResNet's expected input size) and normalize
        tensor = self.resize(tensor)
        tensor = self.normalize(tensor)

        return tensor, sample['target']


# Split by fold (same split as our classical ML evaluation)
train_indices_cnn = [i for i in range(len(esc50)) if esc50[i]['fold'] != 5]
test_indices_cnn = [i for i in range(len(esc50)) if esc50[i]['fold'] == 5]

train_spec_dataset = SpectrogramDataset(train_indices_cnn, esc50)
test_spec_dataset = SpectrogramDataset(test_indices_cnn, esc50)

train_spec_loader = torch.utils.data.DataLoader(train_spec_dataset, batch_size=32, shuffle=True)
test_spec_loader = torch.utils.data.DataLoader(test_spec_dataset, batch_size=32, shuffle=False)

# Peek at a batch
batch_x, batch_y = next(iter(train_spec_loader))
print(f"Batch shape: {batch_x.shape}  (batch_size, channels, height, width)")
print(f"Labels shape: {batch_y.shape}")
print(f"Training batches per epoch: {len(train_spec_loader)}")
print(f"Test batches: {len(test_spec_loader)}")

# Visualize what the CNN will "see"
fig, axes = plt.subplots(1, 4, figsize=(14, 3))
for i in range(4):
    img = batch_x[i][0].numpy()
    axes[i].imshow(img, aspect='auto', origin='lower', cmap='viridis')
    cat_name = unique_categories[batch_y[i].item()]
    axes[i].set_title(cat_name, fontsize=9)
    axes[i].axis('off')
plt.suptitle('Mel Spectrograms as 224x224 Images (what ResNet sees)', fontsize=12)
plt.tight_layout()
plt.show()

In [None]:
# Step 2: Load pretrained ResNet-18 and fine-tune on ESC-50 spectrograms

# Use the best available device
device = torch.device(
    'cuda' if torch.cuda.is_available() else
    'mps' if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available() else
    'cpu'
)
print(f"Training on: {device}")
if str(device) == 'cpu':
    print("(CPU training takes ~20-40 min. Apple Silicon MPS or CUDA would be faster.)\n")

# Load ResNet-18 pretrained on ImageNet (1.2 million photographs!)
resnet = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

# Replace the final layer: ImageNet has 1000 classes, ESC-50 has 50
resnet.fc = torch.nn.Linear(resnet.fc.in_features, 50)

# Freeze early layers — they already extract useful visual features.
# We only fine-tune layer4 (the deepest conv block) + our new classification head.
for name, param in resnet.named_parameters():
    if 'layer4' not in name and 'fc' not in name:
        param.requires_grad = False

trainable = sum(p.numel() for p in resnet.parameters() if p.requires_grad)
total = sum(p.numel() for p in resnet.parameters())
print(f"Total parameters:     {total:,}")
print(f"Trainable parameters: {trainable:,} ({trainable/total:.0%})")
print(f"Frozen parameters:    {total - trainable:,} ({(total - trainable)/total:.0%})")

resnet = resnet.to(device)

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(
    filter(lambda p: p.requires_grad, resnet.parameters()), lr=1e-3
)

# Train!
EPOCHS = 10
train_losses = []
test_accuracies = []

print(f"\nTraining for {EPOCHS} epochs...")
print("=" * 65)

for epoch in range(EPOCHS):
    resnet.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_x, batch_y in tqdm(train_spec_loader, desc=f"Epoch {epoch+1}/{EPOCHS}"):
        batch_x, batch_y = batch_x.to(device), batch_y.to(device)

        optimizer.zero_grad()
        output = resnet(batch_x)
        loss = criterion(output, batch_y)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        _, predicted = output.max(1)
        total += batch_y.size(0)
        correct += predicted.eq(batch_y).sum().item()

    train_acc = correct / total
    avg_loss = total_loss / len(train_spec_loader)
    train_losses.append(avg_loss)

    # Evaluate on test fold
    resnet.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_x, batch_y in test_spec_loader:
            batch_x, batch_y = batch_x.to(device), batch_y.to(device)
            output = resnet(batch_x)
            _, predicted = output.max(1)
            total += batch_y.size(0)
            correct += predicted.eq(batch_y).sum().item()

    test_acc = correct / total
    test_accuracies.append(test_acc)

    print(f"  Loss: {avg_loss:.4f} | Train: {train_acc:.1%} | Test: {test_acc:.1%}")

resnet_accuracy = max(test_accuracies)
print("=" * 65)
print(f"Best test accuracy: {resnet_accuracy:.1%} (epoch {test_accuracies.index(resnet_accuracy)+1})")

# Plot training curves
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].plot(range(1, EPOCHS + 1), train_losses, marker='o')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].set_title('Training Loss')
axes[0].grid(True, alpha=0.3)

axes[1].plot(range(1, EPOCHS + 1), test_accuracies, marker='o', color='green')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Test Accuracy (Fold 5)')
axes[1].grid(True, alpha=0.3)
axes[1].set_ylim(0, 1)

plt.tight_layout()
plt.show()

print(f"\nClassical ML ({best_name}): {accuracy_score(y_test, y_pred):.1%}")
print(f"ResNet-18 (fine-tuned):    {resnet_accuracy:.1%}")
print("(Pretrained AST will be evaluated next!)")

### What Did Transfer Learning Actually Transfer?

When we used a ResNet pretrained on ImageNet (photographs of dogs, cars, furniture...), it already knew how to detect visual features that turn out to be useful for spectrograms too:

| ImageNet Feature | Spectrogram Equivalent |
|-----------------|----------------------|
| Edges, lines | Harmonic frequency bands |
| Textures | Noise patterns (rain, static) |
| Repeating patterns | Rhythmic sounds (clock tick, engine) |
| Shape boundaries | Onset/offset of sound events |

The **early layers** of the network (which we froze) detect these low-level features. The **later layers** (which we fine-tuned) learn to combine them into sound-specific patterns.

**This is the core insight of transfer learning:** features learned for one task (image recognition) can be useful for a completely different task (sound classification) when the input representations share structural similarities — both spectrograms and photos are 2D patterns with local structure!

Now let's see how a model specifically pretrained on **audio data** compares...

---

## Part 7: Pretrained Deep Learning Model

Our ResNet got a nice boost over classical ML, but it was pretrained on *images* — not audio. What if we use a model pretrained on **millions of audio clips**?

Instead of training from scratch, we can download a model that someone has **already trained** on ESC-50. Same idea as using Whisper for speech recognition in Class 4!

### The Audio Spectrogram Transformer (AST)

The model we'll use is the **Audio Spectrogram Transformer (AST)**, developed by MIT researchers.

```
Audio Waveform → Mel Spectrogram → Split into Patches → Transformer Encoder → Classification
```

1. **Mel Spectrogram**: Convert audio to a 2D image of frequency vs. time (just like we did above!)
2. **Patch Embedding**: Split the spectrogram into small patches (like puzzle pieces)
3. **Transformer Encoder**: Process all patches with self-attention (the same architecture behind ChatGPT and Whisper!)
4. **Classification Head**: Output probabilities for each of the 50 ESC-50 categories

### Why Use a Pretrained Model?

| Approach | Accuracy | Training Time | GPU Needed? |
|----------|----------|--------------|-------------|
| Classical ML (our features + Random Forest) | ~50-65% | Seconds | No |
| ResNet-18 on spectrograms (Part 6) | ~75-90% | 10-30 min | No (CPU ok) |
| Train a CNN from scratch | ~70-80% | Hours | Recommended |
| **Pretrained AST** | **~85-95%** | **None — already trained!** | **No (CPU is fine)** |

The AST was first trained on **AudioSet** (2 million audio clips, 527 categories) and then fine-tuned on ESC-50. It has already learned rich audio representations!

In [None]:
# Load the pretrained Audio Spectrogram Transformer (AST) fine-tuned on ESC-50
# First time: downloads ~350MB. After that: loads from cache.

print("Loading pretrained AST model fine-tuned on ESC-50...")
print("(First time: downloads ~350MB. After that: loads from cache)\n")

ast_classifier = pipeline(
    "audio-classification",
    model="bioamla/ast-esc50",
    device="cpu",
)

print("Model loaded successfully!")
print(f"Model type: Audio Spectrogram Transformer (AST)")
print(f"Pre-trained on: AudioSet (2M clips) then fine-tuned on ESC-50")
print(f"Number of classes: 50")

In [None]:
# Classify individual samples and see the model's predictions!
# The model returns the top predicted categories with confidence scores.

example_indices = [0, 50, 100, 200, 400, 600, 800, 1000, 1200, 1500]
temp_path = "_temp_ast_sample.wav"

print("Classifying examples with the pretrained AST:")
print("=" * 70)

for idx in example_indices:
    sample = esc50[idx]
    audio = sample['audio']['array'].astype(np.float32)
    sr = sample['audio']['sampling_rate']
    true_label = sample['category']

    sf.write(temp_path, audio, sr)
    predictions = ast_classifier(temp_path, top_k=3)
    top_pred = predictions[0]['label']
    top_score = predictions[0]['score']

    correct = "correct" if top_pred == true_label else "WRONG"
    print(f"\n  True: {true_label:20s} | Pred: {top_pred:20s} ({top_score:.1%}) [{correct}]")
    for p in predictions[1:]:
        print(f"{'':26s}  also: {p['label']:20s} ({p['score']:.1%})")

if os.path.exists(temp_path):
    os.remove(temp_path)

In [None]:
# Evaluate the pretrained AST on fold 5 (same test set as classical ML)
#
# Set MAX_EVAL_SAMPLES to a smaller number for a quick test,
# or set it to None to evaluate on all 400 test samples (takes ~10-15 min on CPU).

MAX_EVAL_SAMPLES = 100  # Set to None for full evaluation

test_indices = [i for i in range(len(esc50)) if esc50[i]['fold'] == 5]
if MAX_EVAL_SAMPLES is not None:
    test_indices = test_indices[:MAX_EVAL_SAMPLES]

print(f"Evaluating pretrained AST on {len(test_indices)} test samples...")
print("(This takes ~1-3 seconds per sample on CPU)\n")

ast_predictions = []
ast_true_labels = []
temp_path = "_temp_ast_eval.wav"

for idx in tqdm(test_indices):
    sample = esc50[idx]
    audio = sample['audio']['array'].astype(np.float32)
    sr = sample['audio']['sampling_rate']

    sf.write(temp_path, audio, sr)
    predictions = ast_classifier(temp_path, top_k=1)
    ast_predictions.append(predictions[0]['label'])
    ast_true_labels.append(sample['category'])

if os.path.exists(temp_path):
    os.remove(temp_path)

ast_correct = sum(1 for p, t in zip(ast_predictions, ast_true_labels) if p == t)
ast_accuracy = ast_correct / len(ast_true_labels)

print(f"\nPretrained AST Accuracy: {ast_accuracy:.1%}")
print(f"Classical ML ({best_name}) Accuracy: {accuracy_score(y_test, y_pred):.1%}")
print(f"\nThe pretrained model is {(ast_accuracy - accuracy_score(y_test, y_pred))*100:+.1f} percentage points better!")

In [None]:
# Listen to some examples the AST got wrong (if any)

misclassified = [(p, t, i) for p, t, i in zip(ast_predictions, ast_true_labels, test_indices) if p != t]

if not misclassified:
    print("The AST classified every sample correctly! Try evaluating more samples.")
else:
    print(f"The AST got {len(misclassified)} samples wrong. Let's listen to some:\n")
    for pred, true, idx in misclassified[:5]:
        sample = esc50[idx]
        audio = sample['audio']['array']
        sr = sample['audio']['sampling_rate']
        print(f"True: {true:20s} | Predicted: {pred}")
        display(Audio(audio, rate=sr))
        print()

### Comparing All Four Approaches

| Approach | Accuracy | Inference Speed | Training Required? | GPU Needed? |
|----------|----------|----------------|-------------------|-------------|
| Random Forest + librosa features | ~50-65% | Instant | Yes (seconds) | No |
| SVM + librosa features | ~50-65% | Instant | Yes (seconds) | No |
| ResNet-18 on spectrograms | ~75-90% | ~0.5 sec/clip | Yes (10-30 min) | No (CPU ok) |
| Pretrained AST | ~85-95% | ~1-3 sec/clip | No! | No |
| Fine-tune AST from scratch | ~95%+ | ~1-3 sec/clip | Yes (hours) | Yes |

**Key takeaways:**
- Each step up in complexity brings a jump in accuracy — this is the arc of the field!
- **Classical ML** is fast, interpretable, and teaches you the fundamentals
- **ResNet on spectrograms** shows the power of the "audio as image" insight and transfer learning from vision
- **Pretrained AST** is the most accurate with zero training — it was pretrained on 2 million audio clips
- For real-time applications, classical ML has the fastest inference; the AST has the best accuracy

---

## Part 8: Real-Time Environmental Sound Classification

Now for the fun part — using the model to classify sounds from your microphone in real-time!

### Sounds You Can Make in the Classroom

Try these sounds that are in the ESC-50 dataset:
- **Clap your hands** → "clapping"
- **Type on your keyboard** → "keyboard_typing"
- **Click your mouse** → "mouse_click"
- **Knock on your desk** → "door_knock"
- **Cough** → "coughing"
- **Snap your fingers** → see what it classifies as!

### Two Approaches

| Approach | How it works | Best for |
|----------|-------------|----------|
| **Enter-to-Record** | Press Enter, make a sound, get prediction | Simple and reliable |
| **Continuous Listening** | Auto-classifies every 5 seconds | Hands-free, more interactive |

In [None]:
# =============================================================================
# ENTER-TO-RECORD: Record a sound and classify it!
# =============================================================================

SAMPLE_RATE = 44100
DURATION = 5  # ESC-50 clips are 5 seconds
TEMP_FILE = "_temp_realtime.wav"

print("=" * 50)
print("ENTER-TO-RECORD SOUND CLASSIFICATION")
print("=" * 50)
print(f"Instructions:")
print(f"  1. Press Enter when ready to record")
print(f"  2. Make a sound! (clap, knock, type, cough, etc.)")
print(f"  3. Recording lasts {DURATION} seconds")
print(f"  4. See the top 5 predictions!")
print(f"  5. Type 'q' to quit")
print("=" * 50)

try:
    while True:
        user_input = input("\nPress Enter to record (or 'q' to quit): ")
        if user_input.lower() == 'q':
            break

        print(f"Recording for {DURATION} seconds... Make a sound!")
        recording = sd.rec(
            int(DURATION * SAMPLE_RATE),
            samplerate=SAMPLE_RATE,
            channels=1,
            dtype='float32'
        )
        sd.wait()
        print("Processing...\n")

        sf.write(TEMP_FILE, recording, SAMPLE_RATE)

        predictions = ast_classifier(TEMP_FILE, top_k=5)

        print("  Top 5 Predictions:")
        print("  " + "-" * 45)
        for i, pred in enumerate(predictions):
            bar = "*" * int(pred['score'] * 30)
            print(f"  {i+1}. {pred['label']:25s} {pred['score']:6.1%} {bar}")

        display(Audio(TEMP_FILE))

except KeyboardInterrupt:
    pass
finally:
    if os.path.exists(TEMP_FILE):
        os.remove(TEMP_FILE)
    print("\nDone!")

In [None]:
# =============================================================================
# CONTINUOUS LISTENING: Classify sounds automatically every few seconds
# =============================================================================

from IPython.display import clear_output

SAMPLE_RATE = 44100
CHUNK_DURATION = 5  # Match ESC-50 clip length
TEMP_FILE = "_temp_continuous.wav"

chunk_samples = int(CHUNK_DURATION * SAMPLE_RATE)
audio_buffer = []
is_collecting = True

def audio_callback(indata, frames, time_info, status):
    """Collect audio into a buffer."""
    global audio_buffer
    if is_collecting:
        audio_buffer.extend(indata[:, 0].tolist())

print("=" * 50)
print("CONTINUOUS SOUND CLASSIFICATION")
print("=" * 50)
print(f"Classifying every {CHUNK_DURATION} seconds...")
print("Make different sounds and watch the predictions change!")
print("\nPress the STOP button or Kernel > Interrupt to stop")
print("=" * 50)

stream = None
try:
    stream = sd.InputStream(
        samplerate=SAMPLE_RATE,
        channels=1,
        callback=audio_callback,
        blocksize=1024,
    )
    stream.start()

    while True:
        sd.sleep(int(CHUNK_DURATION * 1000))

        if len(audio_buffer) >= chunk_samples:
            chunk = np.array(audio_buffer[-chunk_samples:], dtype=np.float32)
            audio_buffer = []

            sf.write(TEMP_FILE, chunk, SAMPLE_RATE)
            predictions = ast_classifier(TEMP_FILE, top_k=5)

            clear_output(wait=True)
            print("=" * 50)
            print("CONTINUOUS SOUND CLASSIFICATION")
            print("=" * 50)
            rms = np.sqrt(np.mean(chunk**2))
            print(f"Volume: {'*' * int(rms * 100)}")
            print(f"\nTop 5 Predictions:")
            for i, pred in enumerate(predictions):
                bar = "*" * int(pred['score'] * 30)
                print(f"  {i+1}. {pred['label']:25s} {pred['score']:6.1%} {bar}")
            print(f"\nListening... (make a sound!)")
            print("Press STOP or Kernel > Interrupt to stop")

except KeyboardInterrupt:
    pass
finally:
    is_collecting = False
    if stream is not None:
        stream.stop()
        stream.close()
    audio_buffer = []
    if os.path.exists(TEMP_FILE):
        os.remove(TEMP_FILE)
    print("\nStopped!")

---

## Part 9: Triggering Actions with Sound Classification

Just like in previous classes, you can use the model's predictions to **trigger actions** in other programs!

The pattern is always the same:
1. **Classify** a sound from the microphone
2. **Check** what was detected
3. **Send** a message to another program (OSC to p5.js, Serial to Arduino, etc.)

### Project Ideas

| Sound | Action |
|-------|--------|
| Clapping | Toggle a light, change scene in p5.js |
| Dog barking | Send an alert, play a calming sound |
| Keyboard typing | Visualize typing rhythm |
| Door knock | Trigger a door animation |
| Siren | Flash red warning lights |
| Rain | Start a rain visualization |
| Church bells | Change background music |

In [None]:
# =============================================================================
# SOUND CLASSIFICATION -> ACTION
# =============================================================================
# Customize the on_sound() function to trigger whatever you want!

def on_sound(predictions):
    """
    Called whenever a sound is classified.

    predictions: list of dicts with 'label' and 'score' keys
    Example: [{'label': 'clapping', 'score': 0.85}, ...]

    CUSTOMIZE THIS FUNCTION for your project!
    """
    top_label = predictions[0]['label']
    top_score = predictions[0]['score']

    if top_score < 0.3:
        print("  (Low confidence — ignoring)")
        return

    print(f"  Detected: {top_label} ({top_score:.0%})")

    # === ADD YOUR CUSTOM ACTIONS HERE ===

    if top_label == 'clapping':
        print("  -> Clapping! (toggle something)")
        # osc_client.send_message("/sound/clapping", 1)
        # arduino.write(b'1')

    elif top_label == 'door_knock':
        print("  -> Knock knock!")
        # osc_client.send_message("/sound/knock", 1)

    elif top_label == 'keyboard_typing':
        print("  -> Typing detected!")
        # osc_client.send_message("/sound/typing", 1)

    elif top_label in ['dog', 'cat', 'rooster', 'crow']:
        print(f"  -> Animal sound: {top_label}")
        # osc_client.send_message("/sound/animal", top_label)

    elif top_label in ['siren', 'car_horn', 'helicopter']:
        print(f"  -> Urban alert: {top_label}")
        # osc_client.send_message("/sound/alert", top_label)


# Demo with enter-to-record
SAMPLE_RATE = 44100
DURATION = 5
TEMP_FILE = "_temp_action.wav"

print("=" * 50)
print("SOUND -> ACTION DEMO")
print("=" * 50)
print("Make a sound and watch the action trigger!")
print("Type 'q' to quit\n")

try:
    while True:
        user_input = input("Press Enter to record: ")
        if user_input.lower() == 'q':
            break

        print(f"Recording {DURATION}s... make a sound!")
        recording = sd.rec(int(DURATION * SAMPLE_RATE),
                          samplerate=SAMPLE_RATE, channels=1, dtype='float32')
        sd.wait()

        sf.write(TEMP_FILE, recording, SAMPLE_RATE)
        predictions = ast_classifier(TEMP_FILE, top_k=5)

        on_sound(predictions)

except KeyboardInterrupt:
    pass
finally:
    if os.path.exists(TEMP_FILE):
        os.remove(TEMP_FILE)
    print("\nDone!")

In [None]:
# =============================================================================
# SOUND CLASSIFICATION -> OSC (for p5.js or any OSC receiver)
# =============================================================================
# Install: uv pip install python-osc
#
# Uncomment the OSC lines when ready to use!

# from pythonosc import udp_client
# OSC_IP = "127.0.0.1"   # localhost
# OSC_PORT = 12000        # match this to your p5.js sketch
# osc_client = udp_client.SimpleUDPClient(OSC_IP, OSC_PORT)

def send_osc_from_sound(predictions):
    """Send OSC messages based on sound classification."""
    top_label = predictions[0]['label']
    top_score = predictions[0]['score']

    # Send the label and confidence
    # osc_client.send_message("/sound/label", top_label)
    # osc_client.send_message("/sound/confidence", top_score)
    print(f"  [OSC] /sound/label -> '{top_label}'")
    print(f"  [OSC] /sound/confidence -> {top_score:.2f}")

    # Map major categories to colors for a p5.js visualization
    category_colors = {
        'Animals': [255, 165, 0],     # Orange
        'Natural': [0, 150, 0],       # Green
        'Human': [255, 0, 100],       # Pink
        'Domestic': [100, 100, 255],  # Blue
        'Urban': [255, 0, 0],         # Red
    }

    major_cat = get_major_category(top_label)
    if major_cat in category_colors:
        # osc_client.send_message("/sound/color", category_colors[major_cat])
        print(f"  [OSC] /sound/color -> {category_colors[major_cat]} ({major_cat})")

# Demo
print("OSC Sound Classification Demo:")
print("=" * 50)
print("(Uncomment OSC lines above to actually send messages)\n")

demo_preds = [
    [{'label': 'dog', 'score': 0.92}],
    [{'label': 'rain', 'score': 0.78}],
    [{'label': 'clapping', 'score': 0.95}],
]
for preds in demo_preds:
    print(f"Input: {preds[0]['label']}")
    send_osc_from_sound(preds)
    print()

In [None]:
# =============================================================================
# SOUND CLASSIFICATION -> SERIAL (for Arduino)
# =============================================================================
# Detect specific sounds and send commands to Arduino!
#
# Uncomment the serial lines when you have Arduino connected.

# import serial
# SERIAL_PORT = '/dev/cu.usbmodem...'  # Mac: ls /dev/tty* | grep usb
# BAUD_RATE = 9600
# arduino = serial.Serial(SERIAL_PORT, BAUD_RATE)
# import time; time.sleep(2)  # Wait for Arduino reset

def send_serial_from_sound(predictions):
    """Send serial commands based on sound classification."""
    top_label = predictions[0]['label']
    top_score = predictions[0]['score']

    if top_score < 0.4:
        return

    if top_label == 'clapping':
        # arduino.write(b'1')
        print(f"  [Serial] -> '1' (LED ON — clapping detected)")
    elif top_label == 'door_knock':
        # arduino.write(b'2')
        print(f"  [Serial] -> '2' (BUZZ — knock detected)")
    elif top_label in ['siren', 'car_horn']:
        # arduino.write(b'3')
        print(f"  [Serial] -> '3' (FLASH — alert detected)")
    else:
        # arduino.write(b'0')
        print(f"  [Serial] -> '0' (OFF — {top_label} not a trigger)")

# Demo
print("Serial Sound Classification Demo:")
print("=" * 50)
print("(Uncomment serial lines above when Arduino is connected)\n")

demo_preds = [
    [{'label': 'clapping', 'score': 0.91}],
    [{'label': 'door_knock', 'score': 0.75}],
    [{'label': 'keyboard_typing', 'score': 0.60}],
]
for preds in demo_preds:
    print(f"Sound: {preds[0]['label']} ({preds[0]['score']:.0%})")
    send_serial_from_sound(preds)
    print()

print("Combine with the real-time recording code from Part 7!")

---

## Part 10: Fine-Tuning (Stretch Goal — GPU Recommended)

Everything above runs on your laptop's CPU. But if you have access to a GPU (Google Colab has free ones!), you can **fine-tune** the AST model on your own data.

### Why Fine-Tune?

| Use Case | Example |
|----------|---------|
| **Custom sounds** | Classify sounds not in ESC-50 (your specific door, your cat) |
| **Better accuracy** | Focus the model on a subset of classes you care about |
| **New domains** | Underwater sounds, industrial machinery, musical instruments |

### Running on Google Colab

1. Go to [colab.research.google.com](https://colab.research.google.com)
2. Create a new notebook
3. Go to **Runtime → Change runtime type → GPU (T4)**
4. Copy the code from the cell below and run it!

**Detailed guide**: [Fine-Tune AST with Transformers](https://towardsdatascience.com/fine-tune-the-audio-spectrogram-transformer-with-transformers-73333c9ef717/)

In [None]:
# =============================================================================
# FINE-TUNING AST ON ESC-50 (GPU / GOOGLE COLAB RECOMMENDED)
# =============================================================================
#
# To run this:
#   1. Go to colab.research.google.com
#   2. Runtime -> Change runtime type -> GPU (T4)
#   3. Install: !pip install transformers datasets torchaudio evaluate
#   4. Uncomment the code below and run!

# --- Uncomment everything below to run on Google Colab ---

# from transformers import ASTForAudioClassification, ASTFeatureExtractor
# from transformers import TrainingArguments, Trainer
# from datasets import load_dataset, Audio
# import torch

# # Load model and feature extractor
# model_name = "MIT/ast-finetuned-audioset-10-10-0.4593"
# feature_extractor = ASTFeatureExtractor.from_pretrained(model_name)
# model = ASTForAudioClassification.from_pretrained(
#     model_name,
#     num_labels=50,
#     ignore_mismatched_sizes=True,
# )

# # Load ESC-50
# dataset = load_dataset("ashraq/esc50")
# dataset = dataset['train'].train_test_split(test_size=0.2, seed=42)

# # Preprocess audio
# def preprocess(batch):
#     audio = batch['audio']['array']
#     inputs = feature_extractor(
#         audio,
#         sampling_rate=batch['audio']['sampling_rate'],
#         return_tensors='pt',
#     )
#     batch['input_values'] = inputs['input_values'].squeeze()
#     batch['label'] = batch['target']
#     return batch

# dataset = dataset.map(preprocess, remove_columns=['audio', 'filename', 'fold',
#                                                    'target', 'category', 'esc10',
#                                                    'src_file', 'take'])

# # Training arguments
# training_args = TrainingArguments(
#     output_dir="./ast-finetuned-esc50",
#     num_train_epochs=5,
#     per_device_train_batch_size=8,
#     learning_rate=5e-5,
#     logging_steps=10,
#     save_steps=100,
#     eval_strategy="epoch",
# )

# # Train!
# trainer = Trainer(
#     model=model,
#     args=training_args,
#     train_dataset=dataset['train'],
#     eval_dataset=dataset['test'],
# )
# # trainer.train()

print("=" * 60)
print("FINE-TUNING AST — STRETCH GOAL")
print("=" * 60)
print()
print("This code is designed to run on Google Colab with a GPU.")
print()
print("To try it:")
print("  1. Go to colab.research.google.com")
print("  2. Runtime -> Change runtime type -> GPU (T4)")
print("  3. !pip install transformers datasets torchaudio evaluate")
print("  4. Copy this cell's code, uncomment, and run!")
print()
print("Tutorial:")
print("  https://towardsdatascience.com/fine-tune-the-audio-spectrogram-transformer-with-transformers-73333c9ef717/")

---

## Exercises and Homework

In [None]:
# Exercise 1: Explore Misclassifications
#
# Listen to sounds that the model got WRONG and try to understand why.
# Do they sound similar to you? Would a human make the same mistake?
#
# Hint: Look at the confusion matrix to find the most confused pairs,
# then find specific samples of those categories and listen to them.

In [None]:
# Exercise 2: Sound Scavenger Hunt
#
# Go around your space and try to find real examples of sounds in the ESC-50 dataset.
# Record them with the enter-to-record cell and see how the model classifies them!
#
# Try to find at least one sound from each major category:
#   Animals:  _______________
#   Natural:  _______________
#   Human:    _______________
#   Domestic: _______________
#   Urban:    _______________
#
# Questions:
#   - Did the model classify them correctly?
#   - Were some categories easier to find real examples for?
#   - Did any real-world sounds confuse the model?

In [None]:
# Exercise 3: Build a Sound-Triggered Project
#
# Create a project that uses environmental sound classification
# to trigger an action in the real world!
#
# Ideas:
#   - Sound-reactive p5.js visualization (different visuals for different sounds)
#   - Arduino alarm that activates when it hears a specific sound
#   - "Sound diary" that logs what sounds happen in your room over time
#   - Sound-based game: make specific sounds to score points
#
# Use the on_sound() function from Part 8 as your starting point.
# Connect it to OSC, Serial, or any other output you want!

In [None]:
# Exercise 4: Compare the Models
#
# Record 10 different sounds from your environment.
# Classify each one with BOTH the classical ML model and the pretrained AST.
#
# Hint: For the classical ML model, extract features and use the trained
# Random Forest / SVM. For the AST, use the pipeline.
#
# Questions:
#   - Which model is more accurate on real-world sounds?
#   - Which model is faster?
#   - Do they make different kinds of mistakes?

In [None]:
# Exercise 5 (Stretch): ESC-10 Subset
#
# ESC-50 has a simpler subset called ESC-10 with just 10 categories.
# The dataset includes an 'esc10' column that tells you which samples are in this subset.
#
# 1. Filter the dataset to only ESC-10 samples
# 2. Train a new classical ML model on just these 10 classes
# 3. How does accuracy change with fewer, more distinct classes?
#
# Hint: df[df['esc10'] == True] gives you the ESC-10 samples

## Extra Credit

- **Fool the model**: Can you make a sound that the model confidently misclassifies? (Can you make a sound with your mouth that it thinks is a dog bark?)
- **Sound mixing**: What happens if two sounds play at the same time? Does the model pick up both?
- **Distance test**: How does distance from the microphone affect classification?
- **Compare AudioSet model**: Try loading `MIT/ast-finetuned-audioset-10-10-0.4593` (527 categories) and compare its predictions to the ESC-50 model.
- **Feature importance**: Which features matter most for Random Forest? Use `rf.feature_importances_` to find out!
- **Combine with Whisper**: Build a system that uses Whisper for speech AND the AST for environmental sounds — detect *what type* of sound is happening and handle it differently.
- **Data augmentation**: Apply the augmentation techniques from Class 2 (noise, pitch shift, time stretch) to ESC-50 and see if it improves classical ML or ResNet accuracy.
- **Try CLAP (zero-shot)**: [CLAP](https://huggingface.co/laion/larger_clap_music_and_speech) (Contrastive Language-Audio Pretraining) lets you classify audio using **text descriptions** instead of fixed labels. You write descriptions like "a dog barking loudly" and CLAP matches audio to text — no training needed! (`pip install laion-clap`)
- **Visualize attention maps**: The AST model uses attention to decide which parts of the spectrogram matter most. Try extracting and visualizing attention weights to see what the model "listens to."
- **Try UrbanSound8K**: [UrbanSound8K](https://urbansounddataset.weebly.com/) is another environmental sound dataset focused on urban sounds (10 classes, 8,732 clips). How do models trained on ESC-50 perform on it?
- **Unfreeze more ResNet layers**: Try unfreezing layer3 and layer4 in the ResNet fine-tuning section. Does it improve accuracy? How much slower is training?
- **Explore other models on HuggingFace**: Search for audio-classification models at [huggingface.co/models?pipeline_tag=audio-classification](https://huggingface.co/models?pipeline_tag=audio-classification)

---

## Summary

In this notebook you learned:

- **Environmental Sound Classification**: Identifying sounds like dog barks, rain, sirens, and keyboard clicks
- **ESC-50 Dataset**: 2,000 clips across 50 categories — the standard benchmark
- **Audio Feature Extraction**: MFCCs, chroma, spectral features — compact summaries of sound
- **Classical ML**: Random Forest and SVM classifiers using handcrafted features (~50-65%)
- **CNN on Spectrograms**: Fine-tuning ResNet-18 from ImageNet — the "audio as image" paradigm (~75-90%)
- **Pretrained Deep Learning**: Audio Spectrogram Transformer achieving ~85-95% with zero training
- **Evaluation**: Confusion matrices, per-class accuracy, understanding model mistakes
- **Real-time Classification**: Identifying sounds from your microphone
- **Triggering Actions**: Using sound classification to control OSC, Serial, and more

### Key Concepts

| Concept | What You Learned |
|---------|-----------------|
| **Feature extraction** | Turning raw audio into meaningful numbers |
| **Spectrograms as images** | Mel spectrograms can be classified by image models (CNNs, ViTs) |
| **Transfer learning** | Features learned for one task (ImageNet photos) help with another (audio spectrograms) |
| **Pretrained models** | Downloading and using models trained by others — no GPU needed! |
| **Evaluation** | Accuracy alone isn't enough — confusion matrices reveal what the model struggles with |
| **Real-time inference** | The record → classify → act pipeline for interactive applications |

### The Big Picture

From Class 1 to now, you've gone from raw audio signals to building systems that can:

- **Class 1**: Record, visualize, and analyze audio signals
- **Class 2**: Recognize spoken digits with a CNN trained from scratch
- **Class 3**: Detect emotions in speech with classical ML
- **Class 4**: Transcribe any speech with Whisper
- **Class 5**: Classify environmental sounds — from handcrafted features to CNNs to pretrained transformers

Each class builds on the last. The fundamentals (FFT, spectrograms, mel scale) power everything from simple pitch detection to state-of-the-art AI models!