# Audio Transformer Pipeline

This notebook demonstrates an end-to-end workflow for audio classification using a “TinyAST” (Audio Spectrogram Transformer) on the ESC-50 dataset. The code is organised into the following sections:

1. **Setup & Dependencies**  
   - Install necessary Python packages.  
   - Define global paths and parameters.  

2. **Dataset Download**  
   - Clone the ESC-50 GitHub repository into `data/raw/ESC-50-master/` (if not already present).  
   - Brief overview of ESC-50 and its 50 environmental sound classes.

3. **Preprocessing**  
   - Load ESC-50 metadata and raw `.wav` files.  
   - Compute 64-bin Mel spectrograms (16 kHz, 1024-point FFT, 512 hop).  
   - Convert to dB scale, normalise per clip, and save as `.npy` in `data/processed/`.  
   - Generate a `metadata.csv` linking each `.npy` to its class label.

4. **Model Training & Hold-out Evaluation**  
   - Read processed metadata and merge with original labels.  
   - Reserve 10 random clips as a hold-out test set.  
   - Define a custom Dataset that crops/pads spectrograms to 1×64×64.  
   - Build and train a lightweight Vision Transformer (`TinyAST`) for **5 epochs**.  
   - Evaluate on the 10 hold-out samples and print actual vs predicted labels.  
   - **Note:** Because we’re training on only ~400 examples (50 classes) and just 5 epochs due to CPU constraints, overall accuracy will be low. With more training data, longer epochs and GPU acceleration, this approach scales to much higher performance.

5. **Batch Inference (Optional)**  
   - Load the trained model weights.  
   - Run inference on any WAV files in `data/test/` and visualise or print predictions.

Use this notebook as a template for experimenting with attention-based audio models—swap in your own datasets, adjust the model configuration or training schedule, and leverage more compute to improve accuracy.


In [None]:
# install dependencies
%pip install librosa soundfile pandas numpy

## Download and Prepare the ESC-50 Dataset

This cell ensures that the ESC-50 environmental sound dataset is available under `data/raw/ESC-50-master`. If the folder doesn’t already exist, it will:

1. Create the `data/raw` directory (if needed).  
2. Clone the official ESC-50 repository from GitHub into `data/raw/ESC-50-master`.

### About ESC-50

ESC-50 is a benchmark dataset for environmental sound classification. It contains 2,000 five-second audio clips evenly distributed across 50 semantic classes (e.g. dog bark, rain, siren, clock tick). Each clip is annotated with a category label, making it ideal for prototyping and evaluating audio classification pipelines. You can find the original repository and detailed documentation here: https://github.com/karolpiczak/ESC-50.  


In [42]:
#1. Clone ESC-50 into data/raw
import os

# Where we expect the raw ESC-50 repo to live:
RAW_ESC50 = os.path.join("data", "raw", "ESC-50-master")

if not os.path.isdir(RAW_ESC50):
    print(f"Downloading ESC-50 into {RAW_ESC50} …")
    os.makedirs(os.path.dirname(RAW_ESC50), exist_ok=True)
    # Clone the GitHub repo into that location
    !git clone https://github.com/karolpiczak/ESC-50.git "{RAW_ESC50}"
else:
    print("ESC-50 already present at", RAW_ESC50)


Downloading ESC-50 into data\raw\ESC-50-master …


Cloning into 'data\raw\ESC-50-master'...
Updating files:  10% (205/2011)
Updating files:  11% (222/2011)
Updating files:  12% (242/2011)
Updating files:  13% (262/2011)
Updating files:  14% (282/2011)
Updating files:  15% (302/2011)
Updating files:  16% (322/2011)
Updating files:  17% (342/2011)
Updating files:  18% (362/2011)
Updating files:  19% (383/2011)
Updating files:  20% (403/2011)
Updating files:  20% (409/2011)
Updating files:  21% (423/2011)
Updating files:  22% (443/2011)
Updating files:  23% (463/2011)
Updating files:  24% (483/2011)
Updating files:  25% (503/2011)
Updating files:  26% (523/2011)
Updating files:  27% (543/2011)
Updating files:  28% (564/2011)
Updating files:  29% (584/2011)
Updating files:  30% (604/2011)
Updating files:  30% (611/2011)
Updating files:  31% (624/2011)
Updating files:  32% (644/2011)
Updating files:  33% (664/2011)
Updating files:  34% (684/2011)
Updating files:  35% (704/2011)
Updating files:  36% (724/2011)
Updating files:  37% (745/2011)

## Preprocessing ESC-50 into Normalised Mel-Spectrograms

This cell performs the end-to-end conversion of the raw ESC-50 audio clips into training-ready NumPy arrays (*.npy*) and generates a corresponding metadata CSV. In particular, it:

1. **Sets up project paths** without relying on the current working directory, ensuring reproducibility across environments.
2. **Verifies** that the raw ESC-50 audio files and their metadata CSV are present in `data/raw/ESC-50-master/`.
3. **Loads** the ESC-50 metadata table, which contains filenames, class indexes, and class names.
4. **Defines preprocessing parameters**:

   * Sampling rate: 16 kHz
   * Mel filterbanks: 64 bins
   * FFT window: 1024 samples
   * Hop length: 512 samples
5. **Iterates** through each clip and:

   * Loads and resamples the waveform to 16 kHz.
   * Computes a 64-bin mel-spectrogram.
   * Converts the power spectrogram to decibel (dB) scale.
   * Normalises each clip to zero mean and unit variance.
   * Saves the result as a 2D NumPy array in `data/processed/`.
6. **Accumulates** a new metadata DataFrame mapping each `.npy` file to its numeric label.
7. **Writes** this processed metadata to `data/processed/metadata.csv` for seamless integration with downstream training and evaluation steps.

After running this cell, all of the raw audio is represented as uniform, normalised spectrogram arrays, and you have a single CSV that links each processed file to its class label.


In [43]:
# Full Preprocessing Pipeline with Path Check

import os
import pandas as pd
import numpy as np
# Workaround for numpy 1.24+ removing np.complex
np.complex = complex

import librosa
import soundfile as sf

# 1. Define project folder (no chdir, so it won’t fail in this environment)
BASE = r"C:\Users\IAGhe\OneDrive\Documents\Learning\portfolio\audio_spectogram_transformer"

# 2. Define raw & processed data paths
RAW_DIR  = os.path.join(BASE, "data", "raw", "ESC-50-master", "audio")
META_CSV = os.path.join(BASE, "data", "raw", "ESC-50-master", "meta",  "esc50.csv")
PROC_DIR = os.path.join(BASE, "data", "processed")
os.makedirs(PROC_DIR, exist_ok=True)

# 3. Check that raw data is present
if not os.path.isdir(RAW_DIR) or not os.path.isfile(META_CSV):
    print("ERROR: ESC-50 data not found.")
    print("Please ensure the folder structure is:")
    print("  audio_spectogram_transformer/")
    print("    data/raw/ESC-50-master/audio/*.wav")
    print("    data/raw/ESC-50-master/meta/esc50.csv")
else:
    # 4. Load ESC-50 metadata
    meta = pd.read_csv(META_CSV)

    # 5. Preprocessing parameters
    TARGET_SR = 16_000
    N_MELS    = 64
    N_FFT     = 1024
    HOP_LEN   = 512

    # 6. Process each file
    processed_rows = []
    for _, row in meta.iterrows():
        fname   = row["filename"]
        label   = int(row["target"])
        wavpath = os.path.join(RAW_DIR, fname)

        # a) Load & resample
        y, sr = librosa.load(wavpath, sr=TARGET_SR, mono=True)

        # b) Compute mel-spectrogram
        mel = librosa.feature.melspectrogram(
            y=y,
            sr=TARGET_SR,
            n_fft=N_FFT,
            hop_length=HOP_LEN,
            n_mels=N_MELS
        )

        # c) Convert to log scale (dB)
        mel_db = librosa.power_to_db(mel, ref=np.max)

        # d) Normalise per-clip
        mel_db_norm = (mel_db - mel_db.mean()) / (mel_db.std() + 1e-6)

        # e) Save as .npy
        out_name = os.path.splitext(fname)[0] + ".npy"
        out_path = os.path.join(PROC_DIR, out_name)
        np.save(out_path, mel_db_norm.astype(np.float32))

        processed_rows.append({"npy_file": out_name, "label": label})

    # 7. Save processed metadata
    proc_meta = pd.DataFrame(processed_rows)
    proc_meta.to_csv(os.path.join(PROC_DIR, "metadata.csv"), index=False)

    print(f"Saved {len(proc_meta)} mel-spectrograms to '{PROC_DIR}'")


Saved 2000 mel-spectrograms to 'C:\Users\IAGhe\OneDrive\Documents\Learning\portfolio\audio_spectogram_transformer\data\processed'


## Model Training and Evaluation Pipeline

This cell implements an end-to-end workflow for training and evaluating our `TinyAST` audio transformer on the preprocessed ESC-50 mel-spectrograms, with a 10-sample hold-out test:

1. **Paths & Hyperparameters**: Sets project directories and key settings (e.g. number of mel bins, fixed time-frames, batch size, learning rate, epochs, and test set size).
2. **Metadata Loading & Merge**: Reads the processed spectrogram metadata and original ESC-50 labels, then combines them to associate each `.npy` file with its human-readable category.
3. **Hold-out Split**: Randomly selects 10 examples as an unseen test set (seeded for reproducibility) and uses the rest for training.
4. **Fixed-Sized Dataset**: Defines a `FixedSpecDataset` that crops or zero-pads each spectrogram to a fixed shape of `1×64×64` so it matches the transformer’s expected input size.
5. **Model Definition**: Instantiates a `TinyAST` model (a lightweight Vision Transformer adapted for single-channel, 64×64 inputs) using the Hugging Face `ViTModel` backbone.
6. **Training Loop**: Trains the model for the specified number of epochs using cross-entropy loss and the Adam optimizer, printing loss per epoch.
7. **Evaluation on Hold-out Set**: Runs inference on the 10 held-out clips and prints each file’s actual category alongside the model’s predicted category—no visualisation, just a clear text report.

This pipeline demonstrates the full machine learning workflow from data loading, fixed-size preprocessing, model training, and hold-out evaluation in a single Jupyter cell.


In [44]:
# %% Full Training + 10-Sample Test Pipeline (Fixed 64×64 Inputs)

import os, random
import numpy as np
import pandas as pd
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from transformers import ViTConfig, ViTModel

# ────────────────────────────────────────────────────────────────────
# 1) Paths & hyperparameters
# ────────────────────────────────────────────────────────────────────
BASE         = r"C:\Users\IAGhe\OneDrive\Documents\Learning\portfolio\audio_spectogram_transformer"
PROC_DIR     = os.path.join(BASE, "data", "processed")
RAW_META_CSV = os.path.join(BASE, "data", "raw", "ESC-50-master", "meta", "esc50.csv")

N_MELS      = 64
FIXED_T     = 64
BATCH_SIZE  = 8
EPOCHS      = 5
LR          = 1e-3
TEST_SIZE   = 10
SEED        = 42
device      = torch.device("cpu")

# ────────────────────────────────────────────────────────────────────
# 2) Load processed & raw metadata; merge for categories
# ────────────────────────────────────────────────────────────────────
proc_meta = pd.read_csv(os.path.join(PROC_DIR, "metadata.csv"))
proc_meta["filename"] = proc_meta["npy_file"].str.replace(".npy", ".wav")
raw_meta  = pd.read_csv(RAW_META_CSV)
meta = proc_meta.merge(raw_meta[["filename","category","target"]], on="filename")

# Build a lookup for predictions
label_map = dict(zip(raw_meta["target"], raw_meta["category"]))

# ────────────────────────────────────────────────────────────────────
# 3) Hold-out split
# ────────────────────────────────────────────────────────────────────
random.seed(SEED)
test_idxs     = random.sample(list(meta.index), TEST_SIZE)
test_meta     = meta.loc[test_idxs].reset_index(drop=True)
trainval_meta = meta.drop(test_idxs).reset_index(drop=True)

# ────────────────────────────────────────────────────────────────────
# 4) Dataset that crops/pads to [1×N_MELS×FIXED_T]
# ────────────────────────────────────────────────────────────────────
class FixedSpecDataset(Dataset):
    def __init__(self, df, npy_dir, fixed_t):
        self.df      = df
        self.npy_dir = npy_dir
        self.fixed_t = fixed_t

    def __len__(self):
        return len(self.df)

    def __getitem__(self, i):
        row  = self.df.iloc[i]
        spec = np.load(os.path.join(self.npy_dir, row.npy_file))  # (n_mels, T)
        spec = spec[np.newaxis, :, :]                            # (1, n_mels, T)
        _, m, t = spec.shape

        if t >= self.fixed_t:
            spec = spec[:, :, :self.fixed_t]
        else:
            pad = np.zeros((1, m, self.fixed_t - t), dtype=spec.dtype)
            spec = np.concatenate([spec, pad], axis=2)

        return torch.from_numpy(spec).float(), torch.tensor(row.target, dtype=torch.long)

train_ds = FixedSpecDataset(trainval_meta, PROC_DIR, FIXED_T)
test_ds  = FixedSpecDataset(test_meta,     PROC_DIR, FIXED_T)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
test_loader  = DataLoader(test_ds,  batch_size=1,         shuffle=False)

# ────────────────────────────────────────────────────────────────────
# 5) Define TinyAST model
# ────────────────────────────────────────────────────────────────────
cfg = ViTConfig(
    image_size=N_MELS,
    patch_size=16,
    num_channels=1,
    hidden_size=128,
    num_hidden_layers=4,
    num_attention_heads=4,
    intermediate_size=256,
    num_labels=len(raw_meta["target"].unique())
)

class TinyAST(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.vit      = ViTModel(config)
        self.cls_head = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, x):
        out = self.vit(pixel_values=x)  # expects [B,1,64,64]
        return self.cls_head(out.pooler_output)

model = TinyAST(cfg).to(device)

# ────────────────────────────────────────────────────────────────────
# 6) Training
# ────────────────────────────────────────────────────────────────────
optim   = torch.optim.Adam(model.parameters(), lr=LR)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(1, EPOCHS+1):
    model.train()
    total_loss = 0.0
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)
        optim.zero_grad()
        logits = model(xb)
        loss   = loss_fn(logits, yb)
        loss.backward()
        optim.step()
        total_loss += loss.item()
    print(f"Epoch {epoch}/{EPOCHS} — Loss: {total_loss/len(train_loader):.4f}")

# ────────────────────────────────────────────────────────────────────
# 7) Evaluate on hold-out: print actual vs predicted
# ────────────────────────────────────────────────────────────────────
print("\nHold-out Test Set Results:")
model.eval()
with torch.no_grad():
    for i, (spec, actual_lbl) in enumerate(test_loader):
        spec       = spec.to(device)
        pred_lbl   = model(spec).argmax(dim=-1).item()
        fname      = test_meta.loc[i, "filename"]
        actual_cat = test_meta.loc[i, "category"]
        pred_cat   = label_map[pred_lbl]
        print(f"{fname:20s}  actual = {actual_cat:<15s}  predicted = {pred_cat}")


Epoch 1/5 — Loss: 3.8314
Epoch 2/5 — Loss: 3.5725
Epoch 3/5 — Loss: 3.4761
Epoch 4/5 — Loss: 3.3440
Epoch 5/5 — Loss: 3.2762

Hold-out Test Set Results:
4-161579-A-40.wav     actual = helicopter       predicted = clock_alarm
1-47714-A-16.wav      actual = wind             predicted = helicopter
1-17092-A-27.wav      actual = brushing_teeth   predicted = crying_baby
4-201300-A-31.wav     actual = mouse_click      predicted = can_opening
2-139748-B-15.wav     actual = water_drops      predicted = fireworks
2-118625-A-30.wav     actual = door_wood_knock  predicted = door_wood_knock
2-109231-B-9.wav      actual = crow             predicted = pouring_water
1-57163-A-38.wav      actual = clock_tick       predicted = clock_alarm
4-198360-A-49.wav     actual = hand_saw         predicted = toilet_flush
1-43807-A-47.wav      actual = airplane         predicted = train
