# Multimodal Speech, Slide Summarisation and Semantic Search

This notebook documents the end to end pipeline used in this project to transform a
video (or any long technical talk) into:

- a time stamped transcript generated from the audio track,
- text extracted from the slides that appear in the video,
- an aligned representation that links speech segments to the most relevant slide text,
- one global abstractive summary suitable for fast review,
- a semantic search index over the aligned segments, ready to be used in a RAG style workflow.

## Project configuration

The implementation is intentionally modular. Each processing stage is implemented in a separate
module and uses a small number of typed data structures to pass information to the
next stage. The notebook provides a thin orchestration layer around these modules and can be
used as both documentation and a reproducible demonstration of the pipeline.

The pipeline has five core stages, each implemented by a dedicated Python module under `src/`:

1. **Ingest (`ingest.py`)**: Inspect the input video container, extract the raw audio track, and sample frames at regular intervals using `moviepy` and `PIL`.

2. **ASR (`asr.py`)**: Run automatic speech recognition (`faster-whisper`) on the audio track to obtain a time stamped transcript, returned as a list of `TranscriptSegment` objects with precise start and end times (in seconds) and the recognised text.

3. **OCR (`ocr.py`)**: Run OCR with `pytesseract` on the sampled frames and return one `OCRRecord` per frame, containing the frame name, approximate timestamp, and the recognised text.

4. **Alignment (`align.py`)**: Align the ASR and OCR streams and produce a list of `Segment` objects that carry both speech and slide text, with an explicit link to the frame used.

5. **Summarisation (`summarise.py`)**: Use a transformer model from `transformers` to compress the concatenated segment texts into a short abstractive global summary.

On top of these, an additional stage builds an embedding based semantic index:

6. **Embeddings and semantic search (`embeddings.py`, `utils.py`)**: Compute sentence embeddings for each aligned segment, construct a `SemanticIndex`, and run nearest neighbour queries for interactive exploration or retrieval augmented generation.

In addition, `models.py` defines the data structures exchanged between stages and helper functions for serialisation into JSON. Keeping these concerns separated makes it straightforward to reuse the notebook with alternative implementations (for example a different ASR backend or a different summarisation model) by changing only the internals of the corresponding module.

In [1]:
import os, sys, warnings
warnings.filterwarnings("ignore", message=".*IProgress not found.*")

nb_root = os.getcwd()
project_root = os.path.dirname(nb_root)
src_dir = os.path.join(project_root, "src")
sys.path.append(project_root)

from src.ingest import inspect_video, extract_audio, extract_frames
from src.asr import run_asr, preview_transcript
from src.ocr import run_ocr_on_frames, preview_ocr
from src.align import align_transcript_and_ocr, preview_segments
from src.summarise import summarise_segments

from src.embeddings import EmbeddingConfig, load_embedding_model, build_index_from_output_dir
from src.utils import (
    load_segments,
    pretty_print_results,
    semantic_search,
    get_segment_time_range,
    get_segment_text,
)

## Data model and JSON serialisation

The pipeline exchanges data between stages using a small set of dataclasses defined in
`src/models.py`:

- `TranscriptSegment` with fields such as `start`, `end`, `text`.
- `OCRRecord` with fields `time`, `frame`, `text`.
- `Segment` that combines speech and slide information with fields such as
  `start`, `end`, `speech`, `slide_text`, `slide_time`, `slide_frame`.

Each dataclass exposes:

- a `to_dict` method,
- a `from_dict` static constructor,

and there are helper functions such as `transcript_to_jsonable`, `ocr_to_jsonable` and
`segments_to_jsonable` that convert lists of objects to standard lists of dictionaries.

This design means that all intermediate results can be stored as human readable JSON files:

- `transcript.json` is a list of transcript segments in chronological order.
- `ocr.json` is a list of OCR records corresponding to processed frames.
- `segments.json` is a list of aligned multimodal segments.
- `summary.json` summarises the whole pipeline.

Persisting these artefacts has several practical benefits:

- experiments can be resumed from intermediate stages without recomputing heavy steps
  such as ASR and OCR,
- debugging is easier because each stage can be inspected individually,
- training data for downstream models can be built from these JSON files without requiring
  access to the original video or audio.

In [2]:
# Root data directories
data_dir = os.path.join(project_root, "data")
interim_dir = os.path.join(data_dir, "interim")

# Input video to analyse (update this to your own file)
video_path = os.path.join(data_dir, "input.mp4")

# Intermediate artefacts
frame_dir = os.path.join(interim_dir, "frames")
audio_path = os.path.join(interim_dir, "audio.wav")
transcript_path = os.path.join(interim_dir, "transcript.json")
ocr_output_path = os.path.join(interim_dir, "ocr.json")
segments_path = os.path.join(interim_dir, "segments.json")

for d in [data_dir, interim_dir, frame_dir]:
    os.makedirs(d, exist_ok=True)

# Frame extraction and OCR
frame_interval_seconds = 3  # distance between extracted frames in seconds
ocr_frame_stride = 2        # process every Nth frame for OCR to reduce cost

# ASR configuration for faster-whisper
asr_model_size = "small"       # for example: "tiny", "base", "small", "medium", "large-v2"
asr_device = "cuda"            # "cuda" or "cpu"
asr_compute_type = "int8"      # quantisation used by faster-whisper, see its documentation

# Summarisation configuration
summary_model_name = "facebook/bart-large-cnn"
summary_device = 0            # GPU index, or -1 for CPU
summary_max_chars = 3000      # max number of characters per chunk of transcript
summary_max_length = 500      # max number of tokens in the summary of one chunk
summary_min_length = 40       # min number of tokens in the summary of one chunk

The cell above creates the expected directory structure if it does not already exist.

The convention used here is:

- input video files live under `data/`,
- heavy intermediate artefacts are stored in `data/interim/`,
- additional cleaned datasets, if any, can be stored in `data/processed/` by other notebooks.

The frame directory is created under `data/interim/frames/`. It acts both as cache for already
extracted frames and as the input source for the OCR stage.

## 1. Inspect input video

The `inspect_video` helper uses `moviepy` to open the input file and prints basic information,
such as:

- duration (seconds),
- resolution,
- frame rate,
- file size.

This serves mainly as a sanity check that the file is readable and helps to choose reasonable
values for frame sampling (for example, how many frames will be produced with the current
`frame_interval_seconds` setting).

In [3]:
inspect_video(str(video_path))

{'filename': 'input.mp4',
 'path': 'C:\\Users\\kevin\\OneDrive\\Documents\\Work\\Python\\NLP-Videos\\data\\input.mp4',
 'size_mb': 93.71,
 'duration_seconds': 1075.39,
 'duration_minutes': 17.92,
 'fps': 60.0}

## 2. Extract audio and frames

The ingest stage performs two operations:

1. **Audio extraction**: The `extract_audio` function opens the input video with `moviepy`, extracts the audio track and writes it to a `wav` file. The code is careful to avoid recomputation when the file
   already exists on disk. This audio file is the only input required by the ASR stage.

2. **Frame sampling**: The `extract_frames` function iterates over the video at a fixed temporal interval (`frame_interval_seconds`) and saves each sampled frame to the `frame_dir` folder as a JPEG
   image. File names encode the time index (for example `frame_000120.jpg` for a frame taken
   at t approximately 120 s).

The function also exposes a light preview mode that prints how many frames were written
and lists the first few frame names. This is useful to verify that the sampling frequency
is appropriate for the type of video being processed.

Both steps are idempotent: running the cell multiple times is safe and will not corrupt
the existing artefacts.

In [4]:
extract_audio(str(video_path), str(audio_path))

extract_frames(
    video_path=str(video_path),
    frame_dir=str(frame_dir),
    interval_seconds=frame_interval_seconds,
)

Audio file already exists: C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\audio.wav
Planned number of frames: 359
Saving frames to: C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\frames


100%|████████████████████████████████████████████████████████████████████████████████| 359/359 [00:54<00:00,  6.55it/s]

Audio present: True
Number of frame files: 359
First few frame files:
 - frame_00000.jpg
 - frame_00001.jpg
 - frame_00002.jpg
 - frame_00003.jpg
 - frame_00004.jpg





## 3. Automatic speech recognition (ASR)

The ASR stage converts raw audio into a list of time stamped transcript segments. The
implementation in `src/asr.py` is based on `faster-whisper`, which provides a fast and memory
efficient inference wrapper around OpenAI Whisper models.

Main characteristics of `run_asr`:

- Loads the specified Whisper checkpoint (for example `"small"` or `"medium"`).
- Runs inference on the `audio_path` in streaming mode and collects the segments emitted by
  Whisper. Each segment has a start time, an end time and a text field.
- Wraps each raw segment into a `TranscriptSegment` dataclass instance.
- Writes the full list of segments to `transcript_path` as JSON, using the helper
  `transcript_to_jsonable` defined in `models.py`.
- Prints the number of segments produced and the location of the JSON file.

The choice of `asr_model_size`, `asr_device` and `asr_compute_type` controls the trade off
between speed, resource usage and recognition quality.

The helper `preview_transcript` is provided as a light inspection tool. It prints the first `n`
segments with human readable timestamps and is designed to be called from notebooks.

In [5]:
transcript_segments = run_asr(
    audio_path=str(audio_path),
    transcript_path=str(transcript_path),
    model_size=asr_model_size,
    device=asr_device,
    compute_type=asr_compute_type,
)

preview_transcript(transcript_segments, n=10)

Transcript file already exists: C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\transcript.json
Loaded 265 transcript segments.
Total segments: 265
[0.00 -> 6.04] This is a 3.
[6.04 -> 11.52] It's sloppily written and rendered at an extremely low resolution of 28x28 pixels, but your brain
[11.52 -> 14.34] has no trouble recognizing it as a 3.
[14.34 -> 18.52] And I want you to take a moment to appreciate how crazy it is that brains can do this so
[18.52 -> 19.52] effortlessly.
[19.52 -> 25.08] I mean, this, this, and this are also recognizable as 3s, even though the specific values
[25.08 -> 28.84] of each pixel is very different from one image to the next.
[28.84 -> 34.16] The particular light-sensitive cells in your eye that are firing when you see this 3 are
[34.16 -> 37.60] very different from the ones firing when you see this 3.
[37.60 -> 43.00] But something in that crazy smart visual cortex of yours resolves these as representing the


## 4. Optical character recognition (OCR) on frames

The OCR stage recovers slide text from the video frames sampled during ingestion. It operates
on the images stored in `frame_dir` and proceeds as follows:

- Lists all frame files that match the expected naming pattern (for example
  `frame_000120.jpg`). Each file name encodes the approximate time at which the frame was
  captured.
- Iterates over the frames in chronological order and processes only every
  `ocr_frame_stride` th frame in order to keep the total OCR cost manageable.
- For each selected frame:
  - Opens the file with `PIL.Image`.
  - Runs OCR using `pytesseract.image_to_string` with a configuration tuned for slide text.
  - Constructs an `OCRRecord` object with the estimated timestamp (derived from the frame
    index and `frame_interval_seconds`), the frame file name and the recognised text.

The list of `OCRRecord` instances is serialised to `ocr_output_path` as JSON using the helper
`ocr_to_jsonable`. The `preview_ocr` function prints the first records in a readable form
including the approximate time and the corresponding frame file.

The downstream alignment step only relies on the timestamps and the extracted text. If OCR
quality is low for a particular video, the sampling frequency and OCR configuration can be
tuned independently of the rest of the pipeline.

In [6]:
ocr_records = run_ocr_on_frames(
    frame_dir=str(frame_dir),
    ocr_output_path=str(ocr_output_path),
    frame_interval_seconds=frame_interval_seconds,
    ocr_frame_stride=ocr_frame_stride,
)

preview_ocr(ocr_records, n=10)

OCR output already exists: C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\ocr.json
Loaded OCR text for 180 frames.
OCR records: 180
[t ~ 0.0s] frame=frame_00000.jpg

----------------------------------------
[t ~ 6.0s] frame=frame_00002.jpg

----------------------------------------
[t ~ 12.0s] frame=frame_00004.jpg

----------------------------------------
[t ~ 18.0s] frame=frame_00006.jpg

----------------------------------------
[t ~ 24.0s] frame=frame_00008.jpg

----------------------------------------
[t ~ 30.0s] frame=frame_00010.jpg

----------------------------------------
[t ~ 36.0s] frame=frame_00012.jpg

----------------------------------------
[t ~ 42.0s] frame=frame_00014.jpg

----------------------------------------
[t ~ 48.0s] frame=frame_00016.jpg

----------------------------------------
[t ~ 54.0s] frame=frame_00018.jpg
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0

## 5. Align transcript and slide text

The goal of the alignment stage is to attach the most relevant slide text to each speech
segment. The implementation in `src/align.py` performs a simple, but effective, temporal
nearest neighbour matching.

High level algorithm of `align_transcript_and_ocr`:

1. Precompute a list of OCR times (one value per `OCRRecord`).
2. For each `TranscriptSegment` in chronological order:
   - locate the index of the nearest OCR time,
   - copy the corresponding slide text and frame information into a new `Segment` instance.
3. Optionally apply light post processing that merges very short segments and removes borderline
   duplicates in order to obtain smoother segments.

Each resulting `Segment` therefore contains:

- the start and end time of the speech segment,
- the speech text as produced by ASR,
- the slide text and corresponding slide timestamp,
- the name of the frame from which the slide text was extracted.

The list of segments is saved to `segments_path` as JSON to make it easy to inspect or reuse
in downstream tasks.

The helper `preview_segments` prints the first `n` aligned segments, showing both the speech
and the associated slide text.

In [7]:
segments = align_transcript_and_ocr(
    transcript_segments=transcript_segments,
    ocr_records=ocr_records,
    segments_path=str(segments_path),
)

preview_segments(segments, n=10)

Saved aligned multimodal segments to: C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\segments.json
Total segments: 265
Aligned segments: 265
[0.00 -> 6.04] (slide at 6.0) speech='This is a 3....'
Slide text: 
----------------------------------------
[6.04 -> 11.52] (slide at 6.0) speech='It's sloppily written and rendered at an extremely low resol...'
Slide text: 
----------------------------------------
[11.52 -> 14.34] (slide at 12.0) speech='has no trouble recognizing it as a 3....'
Slide text: 
----------------------------------------
[14.34 -> 18.52] (slide at 18.0) speech='And I want you to take a moment to appreciate how crazy it i...'
Slide text: 
----------------------------------------
[18.52 -> 19.52] (slide at 18.0) speech='effortlessly....'
Slide text: 
----------------------------------------
[19.52 -> 25.08] (slide at 24.0) speech='I mean, this, this, and this are also recognizable as 3s, ev...'
Slide text: 
----------------------------------------

## 6. Global abstractive summary

The final stage compresses the aligned transcript into a short textual summary. The function
`summarise_segments` in `src/summarise.py` uses the following strategy:

1. Concatenate the `speech` fields of all segments into a single long transcript string.
2. Split this string into chunks using `_chunk_text`, which respects sentence boundaries
   when possible and ensures that each chunk does not exceed `max_chunk_chars` characters.
3. Instantiate a Hugging Face `pipeline` with a summarisation model (by default
   `"facebook/bart-large-cnn"`) on the requested device (GPU index or CPU).
4. For each chunk:
   - call the summariser with `max_length` and `min_length` constraints,
   - collect the resulting partial summary text.
5. Concatenate all chunk level summaries into one global summary string.

The chunking mechanism is important for two reasons:

- it allows the use of standard encoder decoder models with relatively short maximum input
  lengths on arbitrarily long transcripts,
- it constrains the worst case memory usage during generation.

`summarise_segments` prints the global summary to standard output and writes a structured
`VideoSummary` object as `summary.json` under the output directory.

In [8]:
summary = summarise_segments(
    segments=segments,
    model_name=summary_model_name,
    device=summary_device,
    max_chunk_chars=summary_max_chars,
    max_length=summary_max_length,
    min_length=summary_min_length,
    video_path=video_path,
    output_dir=interim_dir,
    summary_filename="summary.json",
)

Total transcript length (characters): 17906
Number of chunks for summarisation: 7


Device set to use cuda:0


Summarising chunk 1/7 …
Summarising chunk 2/7 …
Summarising chunk 3/7 …
Summarising chunk 4/7 …
Summarising chunk 5/7 …
Summarising chunk 6/7 …


Your max_length is set to 500, but your input_length is only 57. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=28)


Summarising chunk 7/7 …

=== GLOBAL SUMMARY ===

In this video, we look at how a neural network can learn to recognize handwritten digits. In the next, we'll look at the structure component of that. At the end of the two videos, I want to point you to a couple of good resources where you can learn more.
The network starts with a bunch of neurons corresponding to each of the 28 times 28 pixels of the input image, which is 784 neurons in total. The activation in these neurons, again some number that's between 0 and 1, represents how much the system thinks that a given image corresponds with a given digit.
The goal is to have some mechanism that could conceivably combine pixels into edges, or edges into patterns, or patterns into digits. In a perfect world, we might hope that each neuron in the second to last layer of the network corresponds with one of these subcomponents.
The question at hand is, what parameters should the network have? What dials and knobs should you be able to tweak s

## 7. Segment embeddings and semantic search

At this point, the pipeline has produced `segments.json` and `summary.json` under `data/interim/`.
The segments file contains the aligned multimodal units that combine speech and slide text.

The next step is to build a sentence embedding index over these segments so that free form
queries can retrieve the most relevant parts of the talk. This section turns the aligned segments
into a small RAG ready store using `sentence-transformers` and the `SemanticIndex` defined in
`src/utils.py`.

### 7.1 Inspect the segments file

Inspecting a few raw segments clarifies the structure, for example:

- which keys hold the transcript text,
- how timestamps are represented,
- whether slide text or OCR text is present.

The semantic search index uses a combination of such text fields to build embeddings.

In [9]:
segments_records = load_segments(segments_path)
print(f"Loaded {len(segments_records)} segments")

for i, seg in enumerate(segments_records[:3]):
    print(f"--- Segment {i} ---")
    for key, value in seg.items():
        if isinstance(value, str) and len(value) > 120:
            value_display = value[:120] + "..."
        else:
            value_display = value
        print(f"{key}: {value_display}")
    print()

Loaded 265 segments
--- Segment 0 ---
start: 0.0
end: 6.04
mid: 3.02
speech: This is a 3.
slide_text: 
slide_time: 6.0
slide_frame: frame_00002.jpg

--- Segment 1 ---
start: 6.04
end: 11.52
mid: 8.78
speech: It's sloppily written and rendered at an extremely low resolution of 28x28 pixels, but your brain
slide_text: 
slide_time: 6.0
slide_frame: frame_00002.jpg

--- Segment 2 ---
start: 11.52
end: 14.34
mid: 12.93
speech: has no trouble recognizing it as a 3.
slide_text: 
slide_time: 12.0
slide_frame: frame_00004.jpg



Typical keys observed in the output might include:

- `segment_id` or `id`,
- `start` or `start_sec` and `end` or `end_sec`,
- `text`, `transcript`, or `asr_text`,
- `speech` (from the alignment stage),
- `slide_text` or `ocr_text`.

The helper function `build_segment_text` (used internally when computing embeddings) already
handles common combinations of these fields, so small schema changes generally do not require
modifications to this notebook.

### 7.2 Configure and load the embedding model

`EmbeddingConfig` controls aspects such as:

- the sentence-transformers model to use,
- batch size,
- device (`"cpu"` or `"cuda"`),
- whether to apply L2 normalisation to embeddings,
- the name of the embedding file written in the output directory.

In [10]:
config = EmbeddingConfig(
    model_name="sentence-transformers/all-MiniLM-L12-v2",
    batch_size=32,
    device="cuda",  # set to "cpu" if no GPU is available
    normalize_embeddings=True,
    embeddings_filename="segment_embeddings.npy",
)

print(config)

embed_model = load_embedding_model(config)
print(f"Loaded embedding model: {config.model_name}")
print(f"Embedding dimension: {embed_model.get_sentence_embedding_dimension()}")

EmbeddingConfig(model_name='sentence-transformers/all-MiniLM-L12-v2', batch_size=32, device='cuda', normalize_embeddings=True, embeddings_filename='segment_embeddings.npy')
Loaded embedding model: sentence-transformers/all-MiniLM-L12-v2
Embedding dimension: 384


### 7.3 Build or load the semantic search index

The helper `build_index_from_output_dir` performs the following steps:

1. Loads `segments.json` from the specified output directory.
2. Loads `segment_embeddings.npy` if present.
3. Otherwise computes embeddings and saves them to `segment_embeddings.npy`.
4. Builds an in memory nearest neighbour index over the embeddings using `SemanticIndex`.

After this step, semantic search queries operate directly on the in memory index.

In [11]:
index = build_index_from_output_dir(output_dir=interim_dir, config=config, model=embed_model)

print("Semantic search index ready.")
print(f"Number of segments indexed: {len(index.segments)}")

Semantic search index ready.
Number of segments indexed: 265


The raw embeddings array is available on disk and can be inspected directly if required.

In [12]:
import numpy as np

embeddings_path = os.path.join(interim_dir, config.embeddings_filename)
embeddings = np.load(embeddings_path)

print(f"Embeddings shape: {embeddings.shape}")
print("Sample embedding vector (first segment):")
print(embeddings[0][:10])

Embeddings shape: (265, 384)
Sample embedding vector (first segment):
[-0.03239828  0.01507453  0.06563838 -0.03154001  0.05292244 -0.04820523
  0.0812026  -0.00355402  0.05419927 -0.03468784]


### 7.4 Run example search queries

Free form queries are now evaluated against the indexed segments:

- the query is embedded with the same sentence-transformers model,
- nearest neighbours are retrieved in the embedding space,
- results are returned with similarity scores and segment metadata.

In [18]:
semantic_search(index, embed_model,
                "How does the network recognize handwritten digits?")

Query: How does the network recognize handwritten digits?

Rank 1 | score=0.429 | 89.32 s → 93.44 s | id=[no id]
feel like you know what it means when you read or hear about a neural network quote

[SLIDE]
aji1 = 0(Wa; + 7) Machine learning
O) Neural network

ERA BN

SIM SE WEA

KS CPS
7

ARS
SAW,

N

eo
Ea

MSN

SK

Why the layers?
--------------------------------------------------------------------------------
Rank 2 | score=0.429 | 85.04 s → 89.32 s | id=[no id]
My hope is just that you come away feeling like the structure itself is motivated, and to

[SLIDE]
aji1 = 0(Wa; + 7) Machine learning
O) Neural network

ERA BN

SIM SE WEA

KS CPS
7

ARS
SAW,

N

eo
Ea

MSN

SK

Why the layers?
--------------------------------------------------------------------------------
Rank 3 | score=0.427 | 169.04 s → 172.60 s | id=[no id]
What are the neurons, and in what sense are they linked together?

[SLIDE]
Neural network

What are

the me uc so:
--------------------------------------------------

In [19]:
semantic_search(index, embed_model,
                "What is the role of the hidden layer in this example?")

Query: What is the role of the hidden layer in this example?

Rank 1 | score=0.497 | 85.04 s → 89.32 s | id=[no id]
My hope is just that you come away feeling like the structure itself is motivated, and to

[SLIDE]
aji1 = 0(Wa; + 7) Machine learning
O) Neural network

ERA BN

SIM SE WEA

KS CPS
7

ARS
SAW,

N

eo
Ea

MSN

SK

Why the layers?
--------------------------------------------------------------------------------
Rank 2 | score=0.497 | 89.32 s → 93.44 s | id=[no id]
feel like you know what it means when you read or hear about a neural network quote

[SLIDE]
aji1 = 0(Wa; + 7) Machine learning
O) Neural network

ERA BN

SIM SE WEA

KS CPS
7

ARS
SAW,

N

eo
Ea

MSN

SK

Why the layers?
--------------------------------------------------------------------------------
Rank 3 | score=0.436 | 278.88 s → 283.52 s | id=[no id]
And of course the heart of the network, as an information processing mechanism, comes

[SLIDE]
“Hidden layers”

784

EAS
Sra
LAL SES
gn Oe

O24

LLG
My)
LE

EY
'S

In [20]:
semantic_search(index, embed_model,
                "How many parameters does the network have?")


Query: How many parameters does the network have?

Rank 1 | score=0.468 | 947.72 s → 951.96 s | id=[no id]
It's an absurdly complicated function, one that involves 13,000 parameters

[SLIDE]
Network

|
Function

NuUuMbpE
--------------------------------------------------------------------------------
Rank 2 | score=0.446 | 965.72 s → 969.96 s | id=[no id]
looks complicated. I mean, if it were any simpler, what hope would we have that it

[SLIDE]
Network

|
Function

o
(ay
a
<
co
a)
a
Lo
--------------------------------------------------------------------------------
Rank 3 | score=0.446 | 960.84 s → 965.72 s | id=[no id]
But it's just a function nonetheless. And in a way, it's kind of reassuring that it

[SLIDE]
Network

|
Function

o
(ay
a
<
co
a)
a
Lo
--------------------------------------------------------------------------------
Rank 4 | score=0.427 | 540.36 s → 545.24 s | id=[no id]
to potentially capture this pattern, or any other pixel pattern, or the pattern

[SLIDE]
What param

In [21]:
semantic_search(index, embed_model,
                "Why is the sigmoid activation function used?")

Query: Why is the sigmoid activation function used?

Rank 1 | score=0.657 | 841.96 s → 845.92 s | id=[no id]
the first layer, according to these weights, corresponds to one of the terms

[SLIDE]
Sigmoid
--------------------------------------------------------------------------------
Rank 2 | score=0.657 | 649.48 s → 654.60 s | id=[no id]
So the activation of the neuron here is basically a measure of how positive

[SLIDE]
Sigmoid
--------------------------------------------------------------------------------
Rank 3 | score=0.657 | 1059.48 s → 1065.00 s | id=[no id]
simplification. Using sigmoids didn't help training or it was very difficult to train at some point,

[SLIDE]
Sigmoid
--------------------------------------------------------------------------------
Rank 4 | score=0.657 | 654.60 s → 657.84 s | id=[no id]
the relevant weighted sum is.

[SLIDE]
Sigmoid
--------------------------------------------------------------------------------
Rank 5 | score=0.570 | 632.56 s → 635.84 s | 

### 7.5 Using search results programmatically

In a RAG setup, retrieved segments usually serve as context for downstream question answering.
The `search` method returns `SearchResult` instances that carry both similarity metadata and the
original segment payload.

The example below extracts timestamps and segment identifiers for further processing.

In [22]:
query = "How is audio aligned with slide text in the system?"
results = index.search(query=query, model=embed_model, top_k=3)

print(f"Raw result fields: {list(vars(results[0]).keys())}\n")

for r in results:
    seg = r.segment            # dict record
    score = r.score            # float
    rank = r.rank              # int

    seg_id = seg.get("segment_id") or seg.get("id") or "[no id]"
    time_range = get_segment_time_range(seg)
    text = get_segment_text(seg)

    snippet = text[:200] + ("..." if len(text) > 200 else "")
    print(f"Rank = {rank}, segment id = {seg_id}, time = {time_range}, score = {score:.3f}")
    print(snippet)
    print()

Raw result fields: ['rank', 'score', 'text', 'segment', 'extra']

Rank = 1, segment id = [no id], time = 487.92 s → 493.08 s, score = 0.461
Parsing speech, for example, involves taking raw audio and picking out distinct sounds

[SLIDE]
Raw audio

Rank = 2, segment id = [no id], time = 493.08 s → 497.52 s, score = 0.388
which combine to make certain syllables, which combine to form words, which combine

[SLIDE]
t-te —> recognition — re-cog-ni-tion — recognition

Raw audio

titmt

Rank = 3, segment id = [no id], time = 497.52 s → 501.32 s, score = 0.388
to make up phrases and more abstract thoughts, etc.

[SLIDE]
t-te —> recognition — re-cog-ni-tion — recognition

Raw audio

titmt



The resulting structures typically contain enough information to:

- link back to a video player through timestamps and segment identifiers,
- pass relevant text fields to an LLM along with the user question,
- build citation objects of the form `{ "segment_id": ..., "start_sec": ..., "end_sec": ... }`.

This completes a basic bridge between the multimodal segment pipeline and a retrieval augmented
        question answering system.

## Conclusion

This notebook demonstrates the complete multimodal pipeline, starting from a raw
video file and producing:

- an audio transcript derived from the video soundtrack,
- slide text extracted from sampled frames,
- aligned multimodal segments that combine both modalities,
- an abstractive summary suitable for fast review or indexing,
- a semantic search index that turns these segments into a compact, queryable knowledge base.

All heavy computation is delegated to the reusable modules under `src/`, while the notebook
focuses on configuration, orchestration, and interactive exploration. This separation permits
straightforward extension of the project with additional experiments, visualisations or
alternative models in dedicated notebooks.