<a href="https://colab.research.google.com/github/karlbuscheck/whisper-semantic-audio-search/blob/main/whisper_semantic_audio_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Semantic Audio Search With Whisper

This notebook demonstrates a lightweight, end-to-end pipeline for searching long-form audio *by meaning*, not keywords.

### The Pipeline:
1. Transcribe a 14-minute audio clip using OpenAI’s Whisper (`small`) model  
2. Split the transcript using Whisper segments  
3. Embed each segment with a sentence transformer (`all-MiniLM-L6-v2`)  
4. Use FAISS to retrieve the most relevant moments in the audio for a natural-language query

The result is a fast, interpretable semantic search system that returns *what was said* and *exactly when it was said*.

### Models used:
- Speech-to-text: [OpenAI Whisper (small)](https://github.com/openai/whisper)
- Text embeddings: [Sentence-Transformers all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)


## Load an audio file and install dependencies

For this demo, we use a long-form audio clip generated by NotebookLM: a Deep Dive on **[Merlin](https://github.com/karlbuscheck/merlin)**, a what-if engine that forecasts YouTube video performance *before* publication. The specific content is incidental. The goal is to demonstrate how this pipeline scales to **any long-form audio** (podcasts, interviews, lectures, Zoom calls, or videos).

In [None]:
# Load the audio file
from google.colab import files
uploaded = files.upload()
filename = next(iter(uploaded))

Next, we install `ffmpeg` to handle audio decoding and `openai-whisper`, a transformer model that transcribes long-form audio into text.

In [None]:
# Install whisper + the dependencies
!apt-get install -y ffmpeg
!pip install -q openai-whisper

## Transcribe the audio file with Whisper (`whisper-small`)

We use `whisper-small`, a transformer-based speech-to-text model optimized for fast, accurate transcription, to transcribe the 14-minute audio file in a matter of moments.

In [None]:
# Import whisper and transcribe the above audio clip
import whisper

# Initialize the model -- small, for speed and then transcribe the file
model = whisper.load_model("small")
result = model.transcribe(filename)

# Diplay the transcript
print(result["text"])

## Embed the transcript with `all-MiniLM-L6-v2`

With the transcription in hand, we use `all-MiniLM-L6-v2`, a lightweight sentence-transformer, to embed it into a 384-dimensional vector that captures semantic meaning.

In [None]:
# Extract transcript text from Whisper result
text = result["text"]

# Load embedding model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

# Embed the text
emb = model.encode(text, convert_to_tensor=True)

# Display the shape of the embedding
print(emb.shape)

The above embedding is obviously "junk", as it's single, 384-dimensional vector for the entire 14-minute transcript. To make this useful, we need to divide the transcript into digestible chunks.

## Chunk the transcript using Whisper segments

Whisper doesn't produce a single transcript block. Instead, it segments speech into short, timestamped spans based on pauses and sentence boundaries, creating natural units of meaning that are ideal for embedding and search.

Let's extract those segments to create our meanignful embeddings.

In [None]:
# Build segments from the full transcript
segments = result["segments"]

# Extract just the text chunks
texts = [seg["text"] for seg in segments]

# Embed the transcript chunk-by-chunk
emb_seg = model.encode(texts, convert_to_tensor=True)

# Display the new embedding shape
print(emb_seg.shape)

And there it is. We have 169 semantically meaningful chunks, each represented by a 384-dimensional vector.

## Build a vector index with FAISS

We now have one embedding per Whisper segment -- as illustrated byt he 'emb_seg1 shape of (169, 384). FAISS is a library for fast nearest-neighbor search over vectors. It let's us quickly find which transcript chunks are most semantically similar to a given query.

In [None]:
!pip install faiss-cpu

In [None]:
# Build a real vector index
import faiss
index = faiss.IndexFlatL2(384)
index.add(emb_seg.cpu().numpy())

## Ask a question using natural-language query and search

We embed a question using the same `all-MiniLM-L6-v2` model, so that the query lives in the same 384-dimensional space as the transcript chunks. Then, index.search(..., k=5) retruns the top-k cloests chunks (by FAISS distance).

In [None]:
# Run a query
query = "what did the model say about skyscrapers and log transforms?"
q_emb = model.encode([query])

distances, indices = index.search(q_emb, k=5)

For readability, we turn Whisper’s raw second-based timestamps into minute:second format.

In [None]:
# Turn seconds into minutes (and seconds) to improve interpretability
def format_timestamp(seconds: float) -> str:
    m = int(seconds // 60)
    s = seconds % 60
    return f"{m:02d}:{s:05.2f}"

In [None]:
# Retrieve timestamps + text
for idx in indices[0]:
    seg = segments[idx]
    start = format_timestamp(seg["start"])
    end = format_timestamp(seg["end"])
    print(f"[{start} → {end}] {seg['text']}")

In under 100 lines of code, we’ve built a lightweight end-to-end system that transcribes long-form audio, embeds it into a semantic space, as is searchable (with precise timestamps!) via natural langauge.

This architecture is generalizable to all sorts of audio-first knowledge retrieval tasks, from searching meeting notes to YouTube videos -- any long-form audio where *meaning* matters more than exact words.