# 03: Feature Extraction

Extract audio features and CLAP embeddings from the preprocessed dataset.

## Background

This notebook generates two complementary representations of audio content:

1. **Audio features** (rhythm, spectral, tonal) capture low-level acoustic properties derived from signal processing. These 56 features describe tempo, timbre, harmonic content, and rhythmic patterns.

2. **CLAP embeddings** (512-d) capture high-level semantic audio content through a contrastive language-audio model. These embeddings encode musical similarity in a learned latent space.

Both representations are used downstream for manifold learning and visualization.

## Setup

In [1]:
import sys
from pathlib import Path

project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

print(f"Project root: {project_root}")

Project root: /Users/kat/Desktop/code/projects/soundspace


In [14]:
from dataclasses import asdict

import essentia
import numpy as np
import pandas as pd
from tqdm.auto import tqdm

# suppress essentia HPCP normalization warning (informational, not an error)
essentia.log.warningActive = False

from configs.dataset import load_config
from core.embed import CLAP_BATCH_SIZE, center_and_normalize, extract, save_embeddings
from eval.embed import check_embedding_sanity
from features.audio.rhythm import RhythmFeatures, extract_rhythm
from features.audio.spectral import SpectralFeatures, extract_spectral
from features.audio.tonal import TonalFeatures, extract_tonal
from models.clap import ClapEmbedder

In [3]:
config = load_config(project_root / "configs" / "config.yaml")
print(f"Dataset root: {config.paths.dataset_root}")

Dataset root: /Users/kat/Desktop/code/projects/data


In [4]:
df = pd.read_csv(project_root / "notebooks/data/merge_preprocessed.csv")
print(f"Loaded {len(df)} tracks")
df.head()

Loaded 2577 tracks


Unnamed: 0,song_id,quadrant,artist,title,duration,mood,mood_all,mood_all_weights,genre,genre_weights,theme,theme_weights,style,style_weights,arousal,valence,audio_path
0,A002,Q4,Rod Stewart,Country Comfort,282.0,"Agreeable,Positive,Relaxed,Romantic,Serious,St...","Agreeable,Positive,Relaxed,Romantic,Serious,St...",55555555,Pop/Rock,5,"Biographical,Country Life,Family,Lifecycle,Ope...",55555555,"Adult Contemporary,Contemporary Pop/Rock",55,0.375,0.7125,/Users/kat/Desktop/code/projects/data/merge-ba...
1,A014,Q1,Jamiroquai,Feels Just Like It Should,274.0,"Bright,Carefree,Celebratory,Effervescent,Energ...","Bright,Carefree,Celebratory,Effervescent,Energ...",5677777777778888,"Electronic,Pop/Rock,R&B",679,"Club,Day Driving,Partying,Pool Party,TGIF",67777,"Acid Jazz,Adult Alternative Pop/Rock,Alternati...",567999,0.9,0.7125,/Users/kat/Desktop/code/projects/data/merge-ba...
2,A090-94,Q2,2Pac,Fuck the World,253.0,"Angry,Angst-Ridden,Anguished/Distraught,Comple...","Angry,Angst-Ridden,Anguished/Distraught,Broodi...",55555555555577778889999,Rap,8,"Affirmation,Cool & Cocky,Empowering,Introspect...",5556777,"G-Funk,Gangsta Rap,West Coast Rap",888,0.7875,0.1875,/Users/kat/Desktop/code/projects/data/merge-ba...
3,A120-168,Q4,Enya,Paint the Sky With Stars,255.0,"Atmospheric,Calm/Peaceful,Circular,Complex,Det...","Atmospheric,Calm/Peaceful,Circular,Complex,Det...","8,8,8,8,8,8,8,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,...","International,New Age,Pop/Rock",888,"Introspection,Meditation,Reflection,Relaxation...",889999,"Adult Alternative,Adult Alternative Pop/Rock,A...",88888,0.0875,0.75,/Users/kat/Desktop/code/projects/data/merge-ba...
4,A148-102,Q1,Billy Joel,Uptown Girl,197.0,"Amiable/Good-Natured,Brash,Bravado,Bright,Chee...","Amiable/Good-Natured,Brash,Bravado,Bright,Chee...",579999999999,Pop/Rock,9,"In Love,Joy,New Love",599,"Album Rock,Contemporary Pop/Rock,Soft Rock",999,0.8375,0.85,/Users/kat/Desktop/code/projects/data/merge-ba...


## Feature Extraction (Audio Features)

Extract rhythm, spectral, and tonal features from each audio file using librosa and essentia.

### Feature Categories

| Category | Count | Description |
|----------|-------|-------------|
| Rhythm | 23 | Tempo, onset strength, beat intervals, BPM histogram, rhythm transform, beats loudness |
| Spectral | 10 | MFCC 1-5, spectral centroid, contrast, inharmonicity |
| Tonal | 23 | Chroma entropy, scale alignment, HPCP 0-11, key strength, key encoding |

### Feature Field Reference

| Category | Field | Description |
|----------|-------|-------------|
| Rhythm | `tempo_bpm` | Estimated tempo in beats per minute |
| Rhythm | `beat_interval_cv` | Coefficient of variation of beat intervals |
| Rhythm | `bpm_histogram_entropy` | Shannon entropy of BPM distribution |
| Spectral | `mfcc_1_mean` to `mfcc_5_mean` | Mel-frequency cepstral coefficients |
| Spectral | `spectral_centroid_mean` | Center of mass of spectrum |
| Spectral | `inharmonicity_mean` | Deviation from harmonic series |
| Tonal | `chroma_entropy` | Entropy of 12-bin chromagram |
| Tonal | `key_strength` | Confidence of detected key (0-1) |
| Tonal | `hpcp_0` to `hpcp_11` | Harmonic pitch class profile |

In [5]:
features_list = []

for _, row in tqdm(df.iterrows(), total=len(df), desc="Extracting audio features"):
    path = Path(row["audio_path"])
    
    rhythm = extract_rhythm(path)
    spectral = extract_spectral(path)
    tonal = extract_tonal(path)
    
    features_list.append({
        "song_id": row["song_id"],
        **asdict(rhythm),
        **asdict(spectral),
        **asdict(tonal),
    })

Extracting audio features:   0%|          | 0/2577 [00:00<?, ?it/s]

In [6]:
features_df = pd.DataFrame(features_list)
print(f"Shape: {features_df.shape} (expected: {len(df)} rows, 57 columns)")
features_df.head()

Shape: (2577, 57) (expected: 2577 rows, 57 columns)


Unnamed: 0,song_id,tempo_bpm,onset_strength_mean,onset_strength_std,beat_interval_mean,beat_interval_std,beat_interval_cv,bpm_first_peak,bpm_first_weight,bpm_first_spread,...,hpcp_10,hpcp_11,hpcp_entropy,hpcp_std,hpcp_max,hpcp_temporal_std,key_strength,is_minor,key_cos,key_sin
0,A002,135.999178,1.004701,0.93511,0.449038,0.006787,0.015115,133.0,0.553846,0.446154,...,0.177398,0.026258,3.196343,0.062499,0.19793,0.117463,0.900473,0.0,1.0,0.0
1,A014,172.265625,1.341463,1.169756,0.347194,0.010387,0.029918,172.0,0.666667,0.0,...,0.122495,0.074231,3.471662,0.034961,0.172153,0.099603,0.885002,1.0,-1.83697e-16,-1.0
2,A090-94,95.703125,1.280073,0.999407,0.629968,0.008873,0.014085,96.0,0.73913,0.244444,...,0.060443,0.058656,3.519326,0.025764,0.128916,0.114035,0.929787,1.0,0.8660254,-0.5
3,A120-168,107.666016,0.817967,0.304054,0.611376,0.043918,0.071835,96.0,0.191489,0.653846,...,0.005724,0.073755,2.972659,0.07329,0.214168,0.106387,0.971524,0.0,-1.83697e-16,-1.0
4,A148-102,129.199219,0.937469,0.60228,0.466646,0.004587,0.00983,129.0,0.806452,0.193548,...,0.085399,0.082776,3.476839,0.031895,0.138451,0.104242,0.834914,0.0,-0.5,0.866025


In [7]:
features_path = project_root / "notebooks/data/merge_audio_features.csv"
features_df.to_csv(features_path, index=False)
print(f"Saved audio features to {features_path}")

Saved audio features to /Users/kat/Desktop/code/projects/soundspace/notebooks/data/merge_audio_features.csv


## CLAP Embedding Extraction

Extract 512-dimensional CLAP embeddings using the `laion/larger_clap_music` model. These embeddings capture high-level semantic audio content through contrastive learning between audio and text.

In [8]:
embedder = ClapEmbedder.from_pretrained("laion/larger_clap_music")
print(f"Device: {embedder.device}")
print(f"Sample rate: {embedder.sample_rate}")

`torch_dtype` is deprecated! Use `dtype` instead!


Device: cpu
Sample rate: 48000


In [9]:
audio_paths = [Path(p) for p in df["audio_path"]]
track_ids = df["song_id"].tolist()

print(f"Extracting embeddings for {len(audio_paths)} tracks")
print(f"Batch size: {CLAP_BATCH_SIZE}")

Extracting embeddings for 2577 tracks
Batch size: 8


In [10]:
embeddings = extract(
    audio_paths,
    embedder,
    track_ids=track_ids,
    batch_size=CLAP_BATCH_SIZE,
)

print(f"Extracted {len(embeddings)} embeddings")

  processor_inputs = embedder.processor(


Extracted 2577 embeddings


### Embedding Validation

Check for common pathologies: NaN values, infinite values, zero-norm vectors, and embedding spread.

In [11]:
emb_matrix = np.vstack([e.embedding for e in embeddings])
sanity = check_embedding_sanity(emb_matrix)

print(f"Shape: {sanity.n_samples} x {sanity.n_dims}")
print(f"NaN: {sanity.has_nan}, Inf: {sanity.has_inf}, Zero-norm: {sanity.has_zero_norm}")
print(f"Mean pairwise cosine: {sanity.mean_pairwise_cosine:.3f} +/- {sanity.std_pairwise_cosine:.3f}")

Shape: 2577 x 512
NaN: False, Inf: False, Zero-norm: False
Mean pairwise cosine: 0.943 +/- 0.073


### Centering and Normalization

The high mean pairwise cosine (0.94) indicates embeddings cluster in a narrow cone, a common pathology in contrastive models. Centering at the origin then re-normalizing breaks this geometry, improving isotropy and making cosine similarity more discriminative for downstream tasks.

In [15]:
embeddings_normalized = center_and_normalize(embeddings)
print(f"Centered and normalized {len(embeddings_normalized)} embeddings")

Centered and normalized 2577 embeddings


In [16]:
emb_matrix_normalized = np.vstack([e.embedding for e in embeddings_normalized])
sanity_normalized = check_embedding_sanity(emb_matrix_normalized)

print(f"Shape: {sanity_normalized.n_samples} x {sanity_normalized.n_dims}")
print(f"NaN: {sanity_normalized.has_nan}, Inf: {sanity_normalized.has_inf}, Zero-norm: {sanity_normalized.has_zero_norm}")
print(f"Mean pairwise cosine: {sanity_normalized.mean_pairwise_cosine:.3f} +/- {sanity_normalized.std_pairwise_cosine:.3f}")
print(f"Reduction: {sanity.mean_pairwise_cosine:.3f} -> {sanity_normalized.mean_pairwise_cosine:.3f}")

Shape: 2577 x 512
NaN: False, Inf: False, Zero-norm: False
Mean pairwise cosine: 0.021 +/- 0.392
Reduction: 0.943 -> 0.021


In [18]:
embeddings_dir = project_root / "notebooks/data/embeddings"
embeddings_dir.mkdir(exist_ok=True)

embeddings_path = embeddings_dir / "clap_embeddings.npz"
save_embeddings(embeddings, embeddings_path)
print(f"Saved embeddings to {embeddings_path}")

Saved embeddings to /Users/kat/Desktop/code/projects/soundspace/notebooks/data/embeddings/clap_embeddings.npz


In [21]:
embeddings_normalized_path = embeddings_dir / "clap_embeddings_normalized.npz"
save_embeddings(embeddings_normalized, embeddings_normalized_path)
print(f"Saved normalized embeddings to {embeddings_normalized_path}")

Saved normalized embeddings to /Users/kat/Desktop/code/projects/soundspace/notebooks/data/embeddings/clap_embeddings_normalized.npz


## Summary

**Outputs:**
- `notebooks/data/merge_audio_features.csv`: Audio features (song_id + 56 features)
- `notebooks/data/embeddings/clap_embeddings.npz`: Raw CLAP embeddings (track_ids + 512-d vectors)
- `notebooks/data/embeddings/clap_embeddings_normalized.npz`: Centered + normalized embeddings for improved isotropy

In [22]:
print("Feature extraction complete.")
print(f"  Audio features: {features_df.shape[0]} tracks, {features_df.shape[1] - 1} features")
print(f"  CLAP embeddings: {len(embeddings)} tracks, {embeddings[0].dim} dimensions")
print(f"  Normalized embeddings: {len(embeddings_normalized)} tracks (mean cosine: {sanity_normalized.mean_pairwise_cosine:.3f})")

Feature extraction complete.
  Audio features: 2577 tracks, 56 features
  CLAP embeddings: 2577 tracks, 512 dimensions
  Normalized embeddings: 2577 tracks (mean cosine: 0.021)
