# Multimodal Speech and Slide Summarisation

This notebook documents the end to end pipeline used in this project to transform a teaching
video (or any long technical talk) into:

- a time stamped transcript generated from the audio track,
- text extracted from the slides that appear in the video,
- an aligned representation that links speech segments to the most relevant slide text,
- one global abstractive summary suitable for fast review.

The implementation is intentionally modular. Each processing stage is implemented in a separate
module under `src/` and uses a small number of typed data structures to pass information to the
next stage. The notebook provides a thin orchestration layer around these modules and can be
used as both documentation and a reproducible demonstration of the pipeline.

The pipeline has five main stages:

1. **Ingest**: inspect the video container and extract the raw audio track and sampled frames.
2. **ASR**: run automatic speech recognition (ASR) on the audio track in order to obtain a
   time stamped transcript.
3. **OCR**: run optical character recognition (OCR) on sampled frames to recover slide text.
4. **Alignment**: match each speech segment with the temporally closest slide text and build
   rich multimodal segments.
5. **Summarisation**: compress the entire transcript into a short abstractive summary with a
   transformer based text summarisation model.


## Project structure

A typical layout for this repository is

```text
project_root/
  data/
    interim/    # heavy intermediate artefacts (frames, transcripts, OCR output)
    processed/  # cleaned / final data used for modelling (optional)
  src/
    __init__.py
    ingest.py
    asr.py
    ocr.py
    align.py
    summarise.py
    models.py
  notebooks/
    0_introduction.ipynb

```

The goal of this introduction notebook is to live in the `notebooks/` folder and provide a
compact, but technically detailed, walkthrough of the full multimodal pipeline. The more
specialised notebooks can focus on exploratory data analysis, model comparison or MLOps
concerns without having to repeat the core logic.


## Environment and imports

The following cell tries to infer the project root from the current working directory. The
assumption is that:

- you open the notebook from inside the `notebooks/` directory, or
- you open it from the project root.

In both cases we derive the root path and register it on `sys.path` so that the `src.*` modules
can be imported without any additional configuration. If your tree is different, you only need
to adjust the definition of `project_root`.


In [1]:
import os
import sys

# Current notebook directory
nb_root = os.getcwd()

# Infer project root from the notebook location
if os.path.basename(nb_root) == "notebooks":
    project_root = os.path.dirname(nb_root)
else:
    project_root = nb_root

# Conventional source directory
src_dir = os.path.join(project_root, "src")

# Register project root on sys.path so that `import src.*` works in-place
if project_root not in sys.path:
    sys.path.append(project_root)

print(f"Notebook directory: {nb_root}")
print(f"Project root:       {project_root}")
print(f"Source directory:   {src_dir}")

if not os.path.exists(src_dir):
    print("Warning: src directory not found. Adjust project_root if needed.")


Notebook directory: C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\notebooks
Project root:       C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos
Source directory:   C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\src


## Import project modules

The core functionality of the pipeline is implemented as plain Python modules under `src/`:

- `ingest.py` exposes utilities to inspect the input video, extract its audio track and sample
  frames at regular intervals using `moviepy` and `PIL`.
- `asr.py` wraps a `faster-whisper` model and returns a list of `TranscriptSegment` objects
  with precise start and end times (seconds) and the recognised text.
- `ocr.py` runs OCR with `pytesseract` on sampled frames and returns one `OCRRecord` per frame,
  containing the frame name, the approximate timestamp and the recognised text.
- `align.py` aligns the two previous streams and produces a list of `Segment` objects that
  carry both speech and slide text, with an explicit link to the frame used.
- `summarise.py` uses a transformer model from `transformers` to generate a global abstract
  summary from the concatenated segment texts.
- `models.py` defines the data structures used between the stages, and helper functions for
  serialisation into JSON.

Keeping these concerns separated makes it straightforward to reuse the notebook with alternative
implementations (for example a different ASR backend or a different summarisation model) by
changing only the internals of the corresponding module.


In [2]:
from src.ingest import inspect_video, extract_audio, extract_frames
from src.asr import run_asr, preview_transcript
from src.ocr import run_ocr_on_frames, preview_ocr
from src.align import align_transcript_and_ocr, preview_segments
from src.summarise import summarise_segments

print("Project modules imported successfully.")


  from .autonotebook import tqdm as notebook_tqdm


Project modules imported successfully.


## Data model and JSON serialisation

The pipeline exchanges data between stages using a small set of dataclasses defined in
`src/models.py`:

- `TranscriptSegment` with fields `start`, `end`, `text`.
- `OCRRecord` with fields `time`, `frame`, `text`.
- `Segment` that combines speech and slide information with fields such as
  `start`, `end`, `speech`, `slide_text`, `slide_time`, `slide_frame`.

Each dataclass exposes:

- a `to_dict` method,
- a `from_dict` static constructor,

and there are helper functions such as `transcript_to_jsonable`, `ocr_to_jsonable` and
`segments_to_jsonable` that convert lists of objects to standard lists of dictionaries.

This design means that all intermediate results can be stored as human readable JSON files:

- `transcript.json` is a list of transcript segments in chronological order.
- `ocr.json` is a list of OCR records corresponding to processed frames.
- `segments.json` is a list of aligned multimodal segments.

Persisting these artefacts has several practical benefits:

- experiments can be resumed from intermediate stages without recomputing heavy steps
  such as ASR and OCR,
- debugging is easier because each stage can be inspected individually,
- training data for downstream models can be built from these JSON files without requiring
  access to the original video or audio.


## Configuration

This section defines the configuration used throughout the notebook:

- File system locations under `data/`.
- Video specific parameters (input path, frame sampling interval).
- ASR configuration (model size, device, quantisation type).
- OCR configuration (how many frames are actually processed).
- Summarisation configuration (model, device and text length limits).

The values below are reasonable defaults for a first run and can be adapted per project.
In particular:

- `video_path` must be updated so that it points to a valid local file.
- When no GPU is available, `asr_device` should be set to `"cpu"` and `summary_device`
  to `-1`. This will be slower, but the pipeline remains fully functional.
- The summarisation limits (`summary_max_chars`, `summary_max_length`, `summary_min_length`)
  control the trade off between execution time and level of detail in the final summary.


In [3]:
import os

# Root data directories
data_dir = os.path.join(project_root, "data")
interim_dir = os.path.join(data_dir, "interim") # heavy intermediate files (frames, JSON)

# Input video to analyse (update this to your own file)
video_path = os.path.join(data_dir, "input.mp4")

# Intermediate artefacts
frame_dir = os.path.join(interim_dir, "frames")
audio_path = os.path.join(interim_dir, "audio.wav")
transcript_path = os.path.join(interim_dir, "transcript.json")
ocr_output_path = os.path.join(interim_dir, "ocr.json")
segments_path = os.path.join(interim_dir, "segments.json")

# Frame extraction and OCR
frame_interval_seconds = 3  # distance between extracted frames in seconds
ocr_frame_stride = 2        # process every Nth frame for OCR in order to reduce cost

# ASR configuration for faster-whisper
asr_model_size = "small"       # for example: "tiny", "base", "small", "medium", "large-v2"
asr_device = "cuda"            # "cuda" or "cpu"
asr_compute_type = "int8"      # quantisation used by faster-whisper, see its documentation

# Summarisation configuration
summary_model_name = "facebook/bart-large-cnn"
summary_device = 0            # GPU index, or -1 for CPU
summary_max_chars = 3000      # max number of characters per chunk of transcript
summary_max_length = 500      # max number of tokens in the summary of one chunk
summary_min_length = 40       # min number of tokens in the summary of one chunk

print("Video file:     ", video_path)
print("Frames dir:     ", frame_dir)
print("Audio path:     ", audio_path)
print("Transcript path:", transcript_path)
print("OCR output:     ", ocr_output_path)
print("Segments path:  ", segments_path)


Video file:      C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\input.mp4
Frames dir:      C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\frames
Audio path:      C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\audio.wav
Transcript path: C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\transcript.json
OCR output:      C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\ocr.json
Segments path:   C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\segments.json


## Prepare directories

The following cell creates the expected directory structure if it does not already exist.

The convention used here is:

- all input artefacts are stored in `data/raw/`,
- all heavy intermediate artefacts are stored in `data/interim/`,
- additional cleaned datasets, if any, can be stored in `data/processed/` by other notebooks.

The frame directory is created under `data/interim/frames/`. It acts both as cache for already
extracted frames and as the input source for the OCR stage.


In [5]:
import os

for d in [data_dir, interim_dir, frame_dir]:
    os.makedirs(d, exist_ok=True)

print("Directory structure prepared.")


Directory structure prepared.


## 1. Inspect input video

The `inspect_video` helper uses `moviepy` to open the input file and prints basic information,
such as:

- duration (seconds),
- resolution,
- frame rate,
- file size.

This serves mainly as a sanity check that the file is readable and helps to choose reasonable
values for frame sampling (for example, how many frames will be produced with the current
`frame_interval_seconds` setting).

In [6]:
try:
    inspect_video(str(video_path))
except FileNotFoundError as e:
    print(e)


Found video file: input.mp4
Size: 93.71 MB
Duration: 1075.39 seconds (17.92 minutes)
Frame rate (fps): 60.0


## 2. Extract audio and frames

The ingest stage performs two operations:

1. **Audio extraction**

   The `extract_audio` function opens the input video with `moviepy`, extracts the audio track
   and writes it to a `wav` file. The code is careful to avoid recomputation when the file
   already exists on disk.

   This audio file is the only input required by the ASR stage.

2. **Frame sampling**

   The `extract_frames` function iterates over the video at a fixed temporal interval
   (`frame_interval_seconds`) and saves each sampled frame to the `frame_dir` folder as a JPEG
   image. File names encode the time index (for example `frame_000120.jpg` for a frame taken
   at t ≈ 120 s).

   The function also exposes a light `preview` mode that prints how many frames were written
   and lists the first few frame names. This is useful to verify that the sampling frequency
   is appropriate for the type of video being processed.

Both steps are idempotent: running the cell multiple times is safe and will not corrupt
the existing artefacts.


In [7]:
extract_audio(str(video_path), str(audio_path))

extract_frames(
    video_path=str(video_path),
    frame_dir=str(frame_dir),
    interval_seconds=frame_interval_seconds,
)


Extracting audio track ...
MoviePy - Writing audio in C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\audio.wav


                                                                                                                       

MoviePy - Done.
Saved audio to: C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\audio.wav
Planned number of frames: 359
Saving frames to: C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\frames


100%|████████████████████████████████████████████████████████████████████████████████| 359/359 [00:59<00:00,  6.04it/s]

Audio present: True
Number of frame files: 359
First few frame files:
 - frame_00000.jpg
 - frame_00001.jpg
 - frame_00002.jpg
 - frame_00003.jpg
 - frame_00004.jpg





## 3. Automatic speech recognition (ASR)

The ASR stage converts raw audio into a list of time stamped transcript segments. The
implementation in `src/asr.py` is based on `faster-whisper`, which provides a fast and memory
efficient inference wrapper around OpenAI Whisper models.

Main characteristics of `run_asr`:

- Loads the specified Whisper checkpoint (for example `"small"` or `"medium"`).
- Runs inference on the `audio_path` in streaming mode and collects the segments emitted by
  Whisper. Each segment has a start time, an end time and a text field.
- Wraps each raw segment into a `TranscriptSegment` dataclass instance.
- Writes the full list of segments to `transcript_path` as JSON, using the helper
  `transcript_to_jsonable` defined in `models.py`.
- Prints the number of segments produced and the location of the JSON file.

The choice of `asr_model_size`, `asr_device` and `asr_compute_type` controls the trade off
between speed, resource usage and recognition quality.

The helper `preview_transcript` is provided as a light inspection tool. It prints the first `n`
segments with human readable timestamps and is designed to be called from notebooks.


In [8]:
transcript_segments = run_asr(
    audio_path=str(audio_path),
    transcript_path=str(transcript_path),
    model_size=asr_model_size,
    device=asr_device,
    compute_type=asr_compute_type,
)

preview_transcript(transcript_segments, n=5)


Loading faster-whisper model …
Transcribing audio: C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\audio.wav
Saved transcript to: C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\transcript.json
Number of transcript segments: 265
Total segments: 265
[0.00 -> 6.04] This is a 3.
[6.04 -> 11.52] It's sloppily written and rendered at an extremely low resolution of 28x28 pixels, but your brain
[11.52 -> 14.34] has no trouble recognizing it as a 3.
[14.34 -> 18.52] And I want you to take a moment to appreciate how crazy it is that brains can do this so
[18.52 -> 19.52] effortlessly.


## 4. Optical character recognition (OCR) on frames

The OCR stage recovers slide text from the video frames sampled during ingestion. It operates
on the images stored in `frame_dir` and proceeds as follows:

- Lists all frame files that match the expected naming pattern (for example
  `frame_000120.jpg`). Each file name encodes the approximate time at which the frame was
  captured.
- Iterates over the frames in chronological order and processes only every
  `ocr_frame_stride`-th frame in order to keep the total OCR cost manageable.
- For each selected frame:
  - Opens the file with `PIL.Image`.
  - Runs OCR using `pytesseract.image_to_string` with a configuration tuned for slide text
    (single uniform block, no page segmentation into multiple columns).
  - Constructs an `OCRRecord` object with the estimated timestamp (derived from the frame
    index and `frame_interval_seconds`), the frame file name and the recognised text.

The list of `OCRRecord` instances is serialised to `ocr_output_path` as JSON using the helper
`ocr_to_jsonable`. The `preview_ocr` function prints the first records in a readable form
including the approximate time and the corresponding frame file.

The downstream alignment step only relies on the timestamps and the extracted text. If OCR
quality is low for a particular video, the sampling frequency and OCR configuration can be
tuned independently of the rest of the pipeline.


In [9]:
ocr_records = run_ocr_on_frames(
    frame_dir=str(frame_dir),
    ocr_output_path=str(ocr_output_path),
    frame_interval_seconds=frame_interval_seconds,
    ocr_frame_stride=ocr_frame_stride,
)

preview_ocr(ocr_records, n=5)


Running OCR on sampled frames …


100%|████████████████████████████████████████████████████████████████████████████████| 359/359 [00:49<00:00,  7.24it/s]

Saved OCR output for 180 frames to: C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\ocr.json
OCR records: 180
[t ~ 0.0s] frame=frame_00000.jpg

----------------------------------------
[t ~ 6.0s] frame=frame_00002.jpg

----------------------------------------
[t ~ 12.0s] frame=frame_00004.jpg

----------------------------------------
[t ~ 18.0s] frame=frame_00006.jpg

----------------------------------------
[t ~ 24.0s] frame=frame_00008.jpg

----------------------------------------





## 5. Align transcript and slide text

The goal of the alignment stage is to attach the most relevant slide text to each speech
segment. The implementation in `src/align.py` performs a simple, but effective, temporal
nearest neighbour matching.

High level algorithm of `align_transcript_and_ocr`:

1. Precompute a list of OCR times (one value per `OCRRecord`).
2. For each `TranscriptSegment` in chronological order:
   - locate the index of the nearest OCR time using a binary search over the sorted list
     (this is implemented in the helper `_find_nearest_ocr_index`),
   - copy the corresponding slide text and frame information into a new `Segment` instance.
3. Apply a light post-processing step that merges very short segments and removes borderline
   duplicates in order to obtain smoother segments.

Each resulting `Segment` therefore contains:

- the start and end time of the speech segment,
- the speech text as produced by ASR,
- the slide text and corresponding slide timestamp,
- the name of the frame from which the slide text was extracted.

The list of segments is saved to `segments_path` as JSON to make it easy to inspect or reuse
in downstream tasks.

The helper `preview_segments` prints the first `n` aligned segments, showing both the speech
and the associated slide text.


In [10]:
segments = align_transcript_and_ocr(
    transcript_segments=transcript_segments,
    ocr_records=ocr_records,
    segments_path=str(segments_path),
)

preview_segments(segments, n=5)


Saved aligned multimodal segments to: C:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data\interim\segments.json
Total segments: 265
Aligned segments: 265
[0.00 -> 6.04] (slide at 6.0) speech='This is a 3....'
Slide text: 
----------------------------------------
[6.04 -> 11.52] (slide at 6.0) speech='It's sloppily written and rendered at an extremely low resol...'
Slide text: 
----------------------------------------
[11.52 -> 14.34] (slide at 12.0) speech='has no trouble recognizing it as a 3....'
Slide text: 
----------------------------------------
[14.34 -> 18.52] (slide at 18.0) speech='And I want you to take a moment to appreciate how crazy it i...'
Slide text: 
----------------------------------------
[18.52 -> 19.52] (slide at 18.0) speech='effortlessly....'
Slide text: 
----------------------------------------


## 6. Global abstractive summary

The final stage compresses the aligned transcript into a short textual summary. The function
`summarise_segments` in `src/summarise.py` uses the following strategy:

1. Concatenate the `speech` fields of all segments into a single long transcript string.
2. Split this string into chunks using `_chunk_text`, which respects sentence boundaries
   when possible and ensures that each chunk does not exceed `max_chunk_chars` characters.
3. Instantiate a Hugging Face `pipeline` with a summarisation model (by default
   `"facebook/bart-large-cnn"`) on the requested device (GPU index or CPU).
4. For each chunk:
   - call the summariser with `max_length` and `min_length` constraints,
   - collect the resulting partial summary text.
5. Concatenate all chunk-level summaries into one global summary string.

The chunking mechanism is important for two reasons:

- it allows the use of standard encoder-decoder models with relatively short maximum input
  lengths on arbitrarily long transcripts,
- it constrains the worst case memory usage during generation.

`summarise_segments` prints the global summary to standard output and returns it as a string
for further processing in the notebook.


In [12]:
summary = summarise_segments(
    segments=segments,
    model_name=summary_model_name,
    device=summary_device,
    max_chunk_chars=summary_max_chars,
    max_length=summary_max_length,
    min_length=summary_min_length,
)

Total transcript length (characters): 17906
Number of chunks for summarisation: 7


Device set to use cuda:0


Summarising chunk 1/7 …
Summarising chunk 2/7 …
Summarising chunk 3/7 …
Summarising chunk 4/7 …
Summarising chunk 5/7 …
Summarising chunk 6/7 …


Your max_length is set to 500, but your input_length is only 57. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=28)


Summarising chunk 7/7 …

=== GLOBAL SUMMARY ===

In this video, we look at how a neural network can learn to recognize handwritten digits. In the next, we'll look at the structure component of that. At the end of the two videos, I want to point you to a couple of good resources where you can learn more.
The network starts with a bunch of neurons corresponding to each of the 28 times 28 pixels of the input image, which is 784 neurons in total. The activation in these neurons, again some number that's between 0 and 1, represents how much the system thinks that a given image corresponds with a given digit.
The goal is to have some mechanism that could conceivably combine pixels into edges, or edges into patterns, or patterns into digits. In a perfect world, we might hope that each neuron in the second to last layer of the network corresponds with one of these subcomponents.
The question at hand is, what parameters should the network have? What dials and knobs should you be able to tweak s

## Conclusion

This introduction notebook demonstrates the complete multimodal pipeline, starting from a raw
video file and producing:

- an audio transcript derived from the video soundtrack,
- slide text extracted from sampled frames,
- aligned multimodal segments that combine both modalities,
- an abstractive summary suitable for fast review or indexing.

All heavy computation is delegated to the reusable modules under `src/`, while the notebook
focuses on configuration and orchestration. This separation permits straightforward extension
of the project with additional experiments, visualisations or alternative models in dedicated
notebooks.
