# Multimodal NLP: Ingestion + Core Pipeline

This notebook combines the previous **video ingestion** and **core processing** logic into a single end‑to‑end pipeline:

1. Inspect a local input video `input.mp4`.
2. Extract audio as `audio.wav`.
3. Extract frames at a fixed interval.
4. Run ASR with `faster-whisper` and save a time coded transcript.
5. Run OCR on sampled frames.
6. Align ASR segments with nearby OCR frames and export `segments.json`.
7. Build a global abstractive summary of the transcript.

All artefacts are written under `data_example_video/`.


In [1]:
# Optional installations (uncomment in a fresh environment)
# !pip install moviepy tqdm faster-whisper pytesseract pillow transformers sentencepiece

import os
import json
from pathlib import Path

from moviepy import VideoFileClip
from tqdm import tqdm

from faster_whisper import WhisperModel

from PIL import Image
import pytesseract

from transformers import pipeline

# Configure Tesseract path if needed (example for Windows)
# pytesseract.pytesseract.tesseract_cmd = r"C:\\Program Files\\Tesseract-OCR\\tesseract.exe"

DATA_DIR = Path.cwd()
VIDEO_PATH = DATA_DIR / "input.mp4"
VIDEO_DIR = DATA_DIR / "data_example_video"
AUDIO_PATH = VIDEO_DIR / "audio.wav"
FRAME_DIR = VIDEO_DIR / "frames"

VIDEO_DIR.mkdir(exist_ok=True, parents=True)
FRAME_DIR.mkdir(exist_ok=True, parents=True)

print(f"Working directory: {DATA_DIR}")
print(f"Video path: {VIDEO_PATH}")
print(f"Output directory: {VIDEO_DIR}")
print(f"Frame directory: {FRAME_DIR}")


  from .autonotebook import tqdm as notebook_tqdm


Working directory: c:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos
Video path: c:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\input.mp4
Output directory: c:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data_example_video
Frame directory: c:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data_example_video\frames


## 1. Video ingestion
### 1.1 Basic checks on `input.mp4`


In [2]:
if not VIDEO_PATH.exists():
    raise FileNotFoundError(
        f"Could not find {VIDEO_PATH}. Please place 'input.mp4' next to this notebook."
    )

file_size_mb = VIDEO_PATH.stat().st_size / (1024 * 1024)
print(f"Found video file: {VIDEO_PATH.name}")
print(f"Size: {file_size_mb:.2f} MB")

clip = VideoFileClip(str(VIDEO_PATH))
duration = clip.duration  # seconds
fps = clip.fps

print(f"Duration: {duration:.2f} seconds ({duration/60:.2f} minutes)")
print(f"Frame rate (fps): {fps}")


Found video file: input.mp4
Size: 93.71 MB
Duration: 1075.39 seconds (17.92 minutes)
Frame rate (fps): 60.0


### 1.2 Audio extraction


In [3]:
if AUDIO_PATH.exists():
    print(f"Audio file already exists: {AUDIO_PATH}")
else:
    print("Extracting audio track ...")
    clip.audio.write_audiofile(str(AUDIO_PATH))
    print(f"Saved audio to: {AUDIO_PATH}")


Audio file already exists: c:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data_example_video\audio.wav


### 1.3 Frame extraction
Extract JPEG frames at a fixed interval (in seconds).


In [4]:
FRAME_INTERVAL_SECONDS = 3  # adjust as needed

n_frames = int(duration // FRAME_INTERVAL_SECONDS) + 1
print(f"Planned number of frames: {n_frames}")
print(f"Saving frames to: {FRAME_DIR}")

for i, t in enumerate(tqdm(range(0, int(duration) + 1, FRAME_INTERVAL_SECONDS))):
    frame_time = min(t, duration)
    frame = clip.get_frame(frame_time)
    frame_path = FRAME_DIR / f"frame_{i:05d}.jpg"
    if not frame_path.exists():
        img = Image.fromarray(frame)
        img.save(frame_path, format="JPEG")


Planned number of frames: 359
Saving frames to: c:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data_example_video\frames


100%|██████████| 359/359 [00:56<00:00,  6.31it/s]


### 1.4 Sanity check of extracted artefacts


In [5]:
audio_exists = AUDIO_PATH.exists()
frame_files = sorted(FRAME_DIR.glob("frame_*.jpg"))
n_extracted_frames = len(frame_files)

print(f"Audio present: {audio_exists}")
print(f"Number of frame files: {n_extracted_frames}")
print("First few frame files:")
for f in frame_files[:5]:
    print(" -", f.name)


Audio present: True
Number of frame files: 359
First few frame files:
 - frame_00000.jpg
 - frame_00001.jpg
 - frame_00002.jpg
 - frame_00003.jpg
 - frame_00004.jpg


## 2. Core processing: ASR, OCR, alignment, summary
### 2.1 Checks on audio and frames


In [6]:
if not AUDIO_PATH.exists():
    raise FileNotFoundError(f"Expected audio file at {AUDIO_PATH}, but it does not exist.")

if not FRAME_DIR.exists():
    raise FileNotFoundError(f"Expected frame directory at {FRAME_DIR}, but it does not exist.")

frame_files = sorted(FRAME_DIR.glob("frame_*.jpg"))
print(f"Found audio: {AUDIO_PATH}")
print(f"Number of frames found: {len(frame_files)}")
print("First few frames:")
for f in frame_files[:5]:
    print(" -", f.name)


Found audio: c:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data_example_video\audio.wav
Number of frames found: 359
First few frames:
 - frame_00000.jpg
 - frame_00001.jpg
 - frame_00002.jpg
 - frame_00003.jpg
 - frame_00004.jpg


### 2.2 ASR with `faster-whisper`
Produces `transcript_segments.json` with `start`, `end`, and `text` fields.


In [7]:
TRANSCRIPT_PATH = VIDEO_DIR / "transcript_segments.json"

if TRANSCRIPT_PATH.exists():
    print(f"Transcript file already exists: {TRANSCRIPT_PATH}")
    with open(TRANSCRIPT_PATH, "r", encoding="utf-8") as f:
        transcript_segments = json.load(f)
    print(f"Loaded {len(transcript_segments)} transcript segments.")
else:
    print("Loading faster-whisper model ...")
    model = WhisperModel("small", device="cuda", compute_type="int8")

    print(f"Transcribing audio: {AUDIO_PATH}")
    segments, info = model.transcribe(str(AUDIO_PATH), beam_size=5)

    transcript_segments = []
    for seg in segments:
        transcript_segments.append(
            {
                "start": float(seg.start),
                "end": float(seg.end),
                "text": seg.text.strip(),
            }
        )

    with open(TRANSCRIPT_PATH, "w", encoding="utf-8") as f:
        json.dump(transcript_segments, f, ensure_ascii=False, indent=2)

    print(f"Saved transcript to: {TRANSCRIPT_PATH}")
    print(f"Number of transcript segments: {len(transcript_segments)}")


Transcript file already exists: c:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data_example_video\transcript_segments.json
Loaded 373 transcript segments.


In [8]:
print(f"Total segments: {len(transcript_segments)}")
for seg in transcript_segments[:5]:
    print(f"[{seg['start']:.2f} -> {seg['end']:.2f}] {seg['text']}")


Total segments: 373
[0.00 -> 6.04] This is a 3.
[6.04 -> 11.52] It's sloppily written and rendered at an extremely low resolution of 28x28 pixels, but your brain
[11.52 -> 14.34] has no trouble recognizing it as a 3.
[14.34 -> 18.52] And I want you to take a moment to appreciate how crazy it is that brains can do this so
[18.52 -> 19.52] effortlessly.


### 2.3 OCR on sampled frames
Runs Tesseract OCR on a subset of frames and stores the result in `ocr_frames.json`.


In [9]:
OCR_OUTPUT_PATH = VIDEO_DIR / "ocr_frames.json"

# Must match the interval used when extracting frames
FRAME_INTERVAL_SECONDS = 3

# Sample every nth frame for OCR
OCR_FRAME_STRIDE = 2

if OCR_OUTPUT_PATH.exists():
    print(f"OCR output already exists: {OCR_OUTPUT_PATH}")
    with open(OCR_OUTPUT_PATH, "r", encoding="utf-8") as f:
        ocr_records = json.load(f)
    print(f"Loaded OCR text for {len(ocr_records)} frames.")
else:
    ocr_records = []
    print("Running OCR on sampled frames ...")

    frame_files = sorted(FRAME_DIR.glob("frame_*.jpg"))

    for idx, frame_path in enumerate(tqdm(frame_files)):
        if idx % OCR_FRAME_STRIDE != 0:
            continue

        approx_time = idx * FRAME_INTERVAL_SECONDS

        with Image.open(frame_path) as img:
            text = pytesseract.image_to_string(img)

        ocr_records.append(
            {
                "time": float(approx_time),
                "frame": frame_path.name,
                "text": text.strip(),
            }
        )

    with open(OCR_OUTPUT_PATH, "w", encoding="utf-8") as f:
        json.dump(ocr_records, f, ensure_ascii=False, indent=2)

    print(f"Saved OCR output for {len(ocr_records)} frames to: {OCR_OUTPUT_PATH}")


OCR output already exists: c:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data_example_video\ocr_frames.json
Loaded OCR text for 180 frames.


In [10]:
print(f"OCR records: {len(ocr_records)}")
for rec in ocr_records[:5]:
    print(f"[t ~ {rec['time']:.1f}s] frame={rec['frame']}")
    print(rec['text'])
    print("-" * 40)


OCR records: 180
[t ~ 0.0s] frame=frame_00000.jpg

----------------------------------------
[t ~ 6.0s] frame=frame_00002.jpg

----------------------------------------
[t ~ 12.0s] frame=frame_00004.jpg

----------------------------------------
[t ~ 18.0s] frame=frame_00006.jpg

----------------------------------------
[t ~ 24.0s] frame=frame_00008.jpg

----------------------------------------


### 2.4 Temporal alignment of ASR and OCR
Aligns each ASR segment with the nearest OCR time stamp and exports `segments.json`.


In [11]:
SEGMENTS_PATH = VIDEO_DIR / "segments.json"

ocr_times = [rec["time"] for rec in ocr_records]

def find_nearest_ocr_index(target_time: float) -> int:
    if not ocr_times:
        return -1
    best_idx = 0
    best_diff = abs(ocr_times[0] - target_time)
    for i, t in enumerate(ocr_times[1:], start=1):
        diff = abs(t - target_time)
        if diff < best_diff:
            best_diff = diff
            best_idx = i
    return best_idx

segments_merged = []

for seg in transcript_segments:
    mid = 0.5 * (seg["start"] + seg["end"])
    ocr_idx = find_nearest_ocr_index(mid)

    if ocr_idx == -1:
        ocr_text = ""
        ocr_time = None
        ocr_frame = None
    else:
        ocr_rec = ocr_records[ocr_idx]
        ocr_text = ocr_rec["text"]
        ocr_time = ocr_rec["time"]
        ocr_frame = ocr_rec["frame"]

    segments_merged.append(
        {
            "start": seg["start"],
            "end": seg["end"],
            "mid": mid,
            "speech": seg["text"],
            "slide_text": ocr_text,
            "slide_time": ocr_time,
            "slide_frame": ocr_frame,
        }
    )

with open(SEGMENTS_PATH, "w", encoding="utf-8") as f:
    json.dump(segments_merged, f, ensure_ascii=False, indent=2)

print(f"Saved aligned multimodal segments to: {SEGMENTS_PATH}")
print(f"Total segments: {len(segments_merged)}")


Saved aligned multimodal segments to: c:\Users\kevin\OneDrive\Documents\Work\Python\NLP-Videos\data_example_video\segments.json
Total segments: 373


In [12]:
print(f"Aligned segments: {len(segments_merged)}")
for seg in segments_merged[:5]:
    print(
        f"[{seg['start']:.2f} -> {seg['end']:.2f}] (slide at {seg['slide_time']}) speech='{seg['speech'][:60]}...'"
    )
    print(f"Slide text: {seg['slide_text'][:200]}")
    print("-" * 40)


Aligned segments: 373
[0.00 -> 6.04] (slide at 6.0) speech='This is a 3....'
Slide text: 
----------------------------------------
[6.04 -> 11.52] (slide at 6.0) speech='It's sloppily written and rendered at an extremely low resol...'
Slide text: 
----------------------------------------
[11.52 -> 14.34] (slide at 12.0) speech='has no trouble recognizing it as a 3....'
Slide text: 
----------------------------------------
[14.34 -> 18.52] (slide at 18.0) speech='And I want you to take a moment to appreciate how crazy it i...'
Slide text: 
----------------------------------------
[18.52 -> 19.52] (slide at 18.0) speech='effortlessly....'
Slide text: 
----------------------------------------


### 2.5 Global abstractive summary
Concatenate all speech segments and summarise them in chunks.


In [13]:
full_transcript_text = "\n".join(seg["speech"] for seg in segments_merged)

print(f"Total transcript length (characters): {len(full_transcript_text)}")
preview = full_transcript_text[:1000]
if len(full_transcript_text) > 1000:
    preview += "..."
print(preview)


Total transcript length (characters): 17934
This is a 3.
It's sloppily written and rendered at an extremely low resolution of 28x28 pixels, but your brain
has no trouble recognizing it as a 3.
And I want you to take a moment to appreciate how crazy it is that brains can do this so
effortlessly.
I mean, this, this, and this are also recognizable as 3s, even though the specific values
of each pixel is very different from one image to the next.
The particular light-sensitive cells in your eye that are firing when you see this 3 are
very different from the ones firing when you see this 3.
But something in that crazy smart visual cortex of yours resolves these as representing the
same idea, while at the same time recognizing other images as their own distinct ideas.
But if I told you, hey, sit down and write for me a program that takes in a grid of
28x28 pixels like this, and outputs a single number between 0 and 10, telling you what it
thinks the digit is, well, the task goes from comicall

In [17]:
summariser = pipeline(
    "summarization",
    model="facebook/bart-large-cnn",
    device=0,  # first GPU; use -1 for CPU
)

max_chunk_chars = 3000
max_length = 5000

def chunk_text(text, max_chars):
    chunks = []
    current = []
    current_len = 0
    for line in text.split("\n"):
        line_len = len(line) + 1
        if current_len + line_len > max_chars and current:
            chunks.append("\n".join(current))
            current = [line]
            current_len = line_len
        else:
            current.append(line)
            current_len += line_len
    if current:
        chunks.append("\n".join(current))
    return chunks

chunks = chunk_text(full_transcript_text, max_chunk_chars)
print(f"Number of chunks for summarisation: {len(chunks)}")

summaries = []
for idx, ch in enumerate(chunks):
    print(f"Summarising chunk {idx + 1}/{len(chunks)} ...")
    out = summariser(ch, max_length=max_length, min_length=40, do_sample=False)
    summaries.append(out[0]["summary_text"])

global_summary = "\n".join(summaries)

print("\n=== GLOBAL SUMMARY ===\n")
print(global_summary)


Device set to use cuda:0
Your max_length is set to 5000, but your input_length is only 675. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=337)


Number of chunks for summarisation: 7
Summarising chunk 1/7 ...


Your max_length is set to 5000, but your input_length is only 671. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=335)


Summarising chunk 2/7 ...


Your max_length is set to 5000, but your input_length is only 678. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=339)


Summarising chunk 3/7 ...


Your max_length is set to 5000, but your input_length is only 682. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=341)


Summarising chunk 4/7 ...


Your max_length is set to 5000, but your input_length is only 681. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=340)


Summarising chunk 5/7 ...


Your max_length is set to 5000, but your input_length is only 700. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=350)


Summarising chunk 6/7 ...


Your max_length is set to 5000, but your input_length is only 44. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=22)


Summarising chunk 7/7 ...

=== GLOBAL SUMMARY ===

In this video, we look at how a neural network can learn to recognize handwritten digits. In the next, we'll look at the structure of the network, and how it's connected to other neurons.
The network I'm showing here has already been trained to recognize digits. The way the network operates, activations in one layer determine the activations of the next layer. The brightest neuron of that output layer is the network's choice, so to speak, for what digit this image represents.
In a perfect world, we might hope that each neuron in the second to last layer of the network corresponds with one of these subcomponents. That way, going from the third layer to the last one just requires learning which combination of subcomp components corresponds to which digits.
The question at hand is what parameters should the network have? What dials and knobs should you be able to tweak so that it's expressive enough to potentially capture this pattern or 

## 3. Generated artefacts

Under `data_example_video/`:

- `audio.wav`: extracted audio track.
- `frames/frame_XXXXX.jpg`: frames every `FRAME_INTERVAL_SECONDS` seconds.
- `transcript_segments.json`: ASR segments with time codes.
- `ocr_frames.json`: OCR results on sampled frames.
- `segments.json`: aligned multimodal segments (speech + slide text).
