# 05: Audio Retrieval

Query the CLAP embedding index using audio.

## Background

CLAP (Contrastive Language-Audio Pretraining) maps audio into a 512-dimensional embedding space. Similar sounds cluster together, enabling **content-based retrieval**: given an audio query, find the most similar tracks in the database.

Audio queries require preprocessing to match the index:
- Center by database mean (removes dataset bias)
- L2-normalize (enables cosine similarity via dot product)

## Setup

In [None]:
import sys
from pathlib import Path

project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

In [None]:
import numpy as np
import pandas as pd
import torch
from IPython.display import Audio, display
from transformers import ClapModel, ClapProcessor

from search.query import embed_audio, preprocess_query, retrieve_top_k

In [None]:
MODEL_ID = "laion/larger_clap_music"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = ClapModel.from_pretrained(MODEL_ID).to(device).eval()
processor = ClapProcessor.from_pretrained(MODEL_ID)

print(f"Model: {MODEL_ID}")
print(f"Device: {device}")

## Load Index

The index contains:
- `track_ids`: song identifiers
- `embeddings`: centered + L2-normalized vectors (n, 512)
- `mean`: database mean used for centering (512,)

In [None]:
index_path = project_root / "notebooks/data/embeddings/clap_index.npz"
index_data = np.load(index_path, allow_pickle=False)

track_ids = index_data["track_ids"]
embeddings = index_data["embeddings"]
mean = index_data["mean"]

print(f"Index: {len(track_ids)} tracks, {embeddings.shape[1]}-d embeddings")

In [None]:
df_meta = pd.read_csv(project_root / "notebooks/data/merge_preprocessed.csv")
track_to_meta = {str(row["song_id"]): row for _, row in df_meta.iterrows()}
print(f"Metadata: {len(df_meta)} tracks")

## Retrieval Demo

### Self-Retrieval Test

Sanity check: querying with an existing track's embedding should return itself as top-1 with score 1.0.

In [None]:
test_idx = 0
q_self = embeddings[test_idx]
results_self = retrieve_top_k(q_self, embeddings, track_ids, k=5)

print(f"Query: {track_ids[test_idx]}")
print(f"Top-1: {results_self[0][0]} (score: {results_self[0][1]:.4f})")
assert results_self[0][0] == str(track_ids[test_idx]), "Self-retrieval failed!"
assert abs(results_self[0][1] - 1.0) < 1e-6, "Score should be 1.0"
print("\nSelf-retrieval test passed.")

### Audio Query

Query with an audio file. The file is embedded, centered using the database mean, and L2-normalized before retrieval.

In [None]:
query_audio = project_root / "notebooks/data/audio/sample-3.mp3"

if query_audio.exists():
    q_raw = embed_audio(query_audio, model, processor, device)
    q = preprocess_query(q_raw, mean)
    print(f"Query: {query_audio.name}")
    print(f"Raw norm: {np.linalg.norm(q_raw):.4f}, Preprocessed norm: {np.linalg.norm(q):.4f}")
else:
    print(f"Query audio not found: {query_audio}")

In [None]:
if query_audio.exists():
    print("Query audio:")
    display(Audio(query_audio))

In [None]:
if query_audio.exists():
    results = retrieve_top_k(q, embeddings, track_ids, k=10)
    print("Top-10 results:\n")
    for rank, (tid, score) in enumerate(results, 1):
        meta = track_to_meta.get(tid, {})
        artist = meta.get("artist", "?") if isinstance(meta, dict) else getattr(meta, "artist", "?")
        title = meta.get("title", "?") if isinstance(meta, dict) else getattr(meta, "title", "?")
        audio_path = meta.get("audio_path", None) if isinstance(meta, dict) else getattr(meta, "audio_path", None)
        print(f"{rank:2}. {artist} - {title} ({score:.4f})")
        if audio_path and Path(audio_path).exists():
            display(Audio(audio_path))

## Summary

This notebook demonstrates audio-to-audio retrieval using the CLAP embedding index. Query audio is embedded, centered by the database mean, and L2-normalized before computing cosine similarity against indexed tracks.

In [None]:
print(f"Index: {len(track_ids)} tracks, {embeddings.shape[1]}-d")
print(f"Model: {MODEL_ID}")