# 03: Feature Extraction

Extract audio features and CLAP embeddings from the preprocessed dataset.

## Background

This notebook generates two complementary representations of audio content:

1. **Audio features** (rhythm, spectral, tonal) capture low-level acoustic properties derived from signal processing. These 56 features describe tempo, timbre, harmonic content, and rhythmic patterns.

2. **CLAP embeddings** (512-d) capture high-level semantic audio content through a contrastive language-audio model. These embeddings encode musical similarity in a learned latent space.

Both representations are used downstream for manifold learning and visualization.

## Setup

In [None]:
import sys
from pathlib import Path

project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

print(f"Project root: {project_root}")

In [None]:
import os
from dataclasses import asdict

import essentia
import numpy as np
import pandas as pd
from tqdm.auto import tqdm

# suppress essentia HPCP normalization warning (informational, not an error)
essentia.log.warningActive = False

from configs.dataset import load_config
from core.embed import CLAP_BATCH_SIZE, center_and_normalize, extract, save_embeddings
from eval.embed import check_embedding_sanity
from features.parallel import extract_features_parallel
from models.clap import ClapEmbedder

N_WORKERS = os.cpu_count() or 4

In [None]:
config = load_config(project_root / "configs" / "config.yaml")
print(f"Dataset root: {config.paths.dataset_root}")

In [None]:
df = pd.read_csv(project_root / "notebooks/data/merge_preprocessed.csv")
print(f"Loaded {len(df)} tracks")
df.head()

## Feature Extraction (Audio Features)

Extract rhythm, spectral, and tonal features from each audio file using librosa and essentia.

### Feature Categories

| Category | Count | Description |
|----------|-------|-------------|
| Rhythm | 23 | Tempo, onset strength, beat intervals, BPM histogram, rhythm transform, beats loudness |
| Spectral | 10 | MFCC 1-5, spectral centroid, contrast, inharmonicity |
| Tonal | 23 | Chroma entropy, scale alignment, HPCP 0-11, key strength, key encoding |

### Feature Field Reference

| Category | Field | Description |
|----------|-------|-------------|
| Rhythm | `tempo_bpm` | Estimated tempo in beats per minute |
| Rhythm | `beat_interval_cv` | Coefficient of variation of beat intervals |
| Rhythm | `bpm_histogram_entropy` | Shannon entropy of BPM distribution |
| Spectral | `mfcc_1_mean` to `mfcc_5_mean` | Mel-frequency cepstral coefficients |
| Spectral | `spectral_centroid_mean` | Center of mass of spectrum |
| Spectral | `inharmonicity_mean` | Deviation from harmonic series |
| Tonal | `chroma_entropy` | Entropy of 12-bin chromagram |
| Tonal | `key_strength` | Confidence of detected key (0-1) |
| Tonal | `hpcp_0` to `hpcp_11` | Harmonic pitch class profile |

In [None]:
audio_paths = df["audio_path"].tolist()
song_ids = df["song_id"].tolist()

print(f"Extracting features from {len(audio_paths)} tracks using {N_WORKERS} workers")

features_list = extract_features_parallel(audio_paths, song_ids, n_workers=N_WORKERS)

In [None]:
features_df = pd.DataFrame(features_list)
print(f"Shape: {features_df.shape} (expected: {len(df)} rows, 57 columns)")
features_df.head()

In [None]:
features_path = project_root / "notebooks/data/merge_audio_features.csv"
features_df.to_csv(features_path, index=False)
print(f"Saved audio features to {features_path}")

## CLAP Embedding Extraction

Extract 512-dimensional CLAP embeddings using the `laion/larger_clap_music` model. These embeddings capture high-level semantic audio content through contrastive learning between audio and text.

In [None]:
embedder = ClapEmbedder.from_pretrained("laion/larger_clap_music")
print(f"Device: {embedder.device}")
print(f"Sample rate: {embedder.sample_rate}")

In [None]:
audio_paths = [Path(p) for p in df["audio_path"]]
track_ids = df["song_id"].tolist()

print(f"Extracting embeddings for {len(audio_paths)} tracks")
print(f"Batch size: {CLAP_BATCH_SIZE}")

In [None]:
embeddings = []
n_batches = (len(audio_paths) + CLAP_BATCH_SIZE - 1) // CLAP_BATCH_SIZE

for i in tqdm(range(n_batches), desc="Extracting CLAP embeddings"):
    start = i * CLAP_BATCH_SIZE
    end = min(start + CLAP_BATCH_SIZE, len(audio_paths))

    batch_paths = audio_paths[start:end]
    batch_ids = track_ids[start:end]

    batch_embeddings = extract(batch_paths, embedder, track_ids=batch_ids, batch_size=CLAP_BATCH_SIZE)
    embeddings.extend(batch_embeddings)

print(f"Extracted {len(embeddings)} embeddings")

### Embedding Validation

Check for common pathologies: NaN values, infinite values, zero-norm vectors, and embedding spread.

In [None]:
emb_matrix = np.vstack([e.embedding for e in embeddings])
sanity = check_embedding_sanity(emb_matrix)

print(f"Shape: {sanity.n_samples} x {sanity.n_dims}")
print(f"NaN: {sanity.has_nan}, Inf: {sanity.has_inf}, Zero-norm: {sanity.has_zero_norm}")
print(f"Mean pairwise cosine: {sanity.mean_pairwise_cosine:.3f} +/- {sanity.std_pairwise_cosine:.3f}")

### Centering and Normalization

The high mean pairwise cosine (0.94) indicates embeddings cluster in a narrow cone, a common pathology in contrastive models. Centering at the origin then re-normalizing breaks this geometry, improving isotropy and making cosine similarity more discriminative for downstream tasks.

In [None]:
embeddings_normalized = center_and_normalize(embeddings)
print(f"Centered and normalized {len(embeddings_normalized)} embeddings")

In [None]:
emb_matrix_normalized = np.vstack([e.embedding for e in embeddings_normalized])
sanity_normalized = check_embedding_sanity(emb_matrix_normalized)

print(f"Shape: {sanity_normalized.n_samples} x {sanity_normalized.n_dims}")
print(f"NaN: {sanity_normalized.has_nan}, Inf: {sanity_normalized.has_inf}, Zero-norm: {sanity_normalized.has_zero_norm}")
print(f"Mean pairwise cosine: {sanity_normalized.mean_pairwise_cosine:.3f} +/- {sanity_normalized.std_pairwise_cosine:.3f}")
print(f"Reduction: {sanity.mean_pairwise_cosine:.3f} -> {sanity_normalized.mean_pairwise_cosine:.3f}")

In [None]:
embeddings_dir = project_root / "notebooks/data/embeddings"
embeddings_dir.mkdir(exist_ok=True)

embeddings_path = embeddings_dir / "clap_embeddings.npz"
save_embeddings(embeddings, embeddings_path)
print(f"Saved embeddings to {embeddings_path}")

In [None]:
embeddings_normalized_path = embeddings_dir / "clap_embeddings_normalized.npz"
save_embeddings(embeddings_normalized, embeddings_normalized_path)
print(f"Saved normalized embeddings to {embeddings_normalized_path}")

### Retrieval Index

Create a retrieval index that includes the database mean vector. Queries must be centered using this mean (not their own mean) to ensure consistent preprocessing.

In [None]:
# create retrieval index with saved mean
emb_matrix_raw = np.vstack([e.embedding for e in embeddings])
mean_vector = emb_matrix_raw.mean(axis=0)
centered = emb_matrix_raw - mean_vector
norms = np.linalg.norm(centered, axis=1, keepdims=True)
emb_matrix_norm = centered / norms

index_path = embeddings_dir / "clap_index.npz"
np.savez_compressed(
    index_path,
    track_ids=np.array([e.track_id for e in embeddings]),
    embeddings=emb_matrix_norm.astype(np.float32),
    mean=mean_vector.astype(np.float32),
)
print(f"Saved retrieval index to {index_path}")
print(f"  track_ids: {len(embeddings)}")
print(f"  embeddings: {emb_matrix_norm.shape}")
print(f"  mean: {mean_vector.shape}")

## Summary

**Outputs:**
- `notebooks/data/merge_audio_features.csv`: Audio features (song_id + 56 features)
- `notebooks/data/embeddings/clap_embeddings.npz`: Raw CLAP embeddings (track_ids + 512-d vectors)
- `notebooks/data/embeddings/clap_embeddings_normalized.npz`: Centered + normalized embeddings for improved isotropy
- `notebooks/data/embeddings/clap_index.npz`: Retrieval index (track_ids + embeddings + mean vector)

In [None]:
print("Feature extraction complete.")
print(f"  Audio features: {features_df.shape[0]} tracks, {features_df.shape[1] - 1} features")
print(f"  CLAP embeddings: {len(embeddings)} tracks, {embeddings[0].dim} dimensions")
print(f"  Normalized embeddings: {len(embeddings_normalized)} tracks (mean cosine: {sanity_normalized.mean_pairwise_cosine:.3f})")