In [None]:
%pip install --quiet --upgrade transformers timm supervision tqdm "scenedetect[opencv]" pillow einops
%pip install --quiet --upgrade openai "google-genai" pydantic pandas 
%pip install --quiet --upgrade yt-dlp ffmpeg-python

In [None]:
# API KEYS GO HERE, ASK SOMA!!

# Regions of interest

Oftentimes you don't want to analyze the *whole* video or the *whole* image. For example, in our last notebook we just looked at the middle video in each scene. That's called a **region of interest.** By adding in an extra step that cuts out unwanted information (or selects wanted information) you can often use cheaper, faster, or less powerful tools later on in the process.

For videos regions of interest are usually thought of as **time**, and for images it's usually thought of as **objects**.

## Downloading our video

Let's download that same video *again*.

In [None]:
%pip install --quiet --upgrade yt-dlp

In [None]:
import yt_dlp

url = "https://www.youtube.com/watch?v=rDXubdQdJYs"

# 720p or less
ydl_opts = {
    'format': 'bestvideo[height<=720]+bestaudio/best[height<=720]',
    'outtmpl': '%(id)s.%(ext)s',
    'merge_output_format': 'mp4',
    'postprocessors': [{
        'key': 'FFmpegVideoConvertor',
        'preferedformat': 'mp4',  # force re-encode into Quick Look–friendly format
    }],
}

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    ydl.download([url])

In [None]:
from IPython.display import Video

Video("rDXubdQdJYs.mp4")

## Selecting times based on transcripts

When DocumentedNY analyzed [misinformation on TikTok](https://pulitzercenter.org/misinformation-tiktok-how-documented-examined-hundreds-videos-different-languages) they didn't need to actually analyze the video: the transcript did almost all of the work!

Turning audio into easily parseable text is a great way to filter down a long stretch of video into just the portions you want. Instead of watching a million videos start-to-finish of RFK Jr. or Pete Hegseth talking on podcasts and Fox News, techniques like this allow you to quickly narrow down your target (and *then* maybe use some image- or video-based AI).

In [None]:
%pip install --quiet --upgrade openai-whisper ffmpeg-python

We'll start by using ffmpeg to convert our video to an mp3. While we *could* have downloaded just the audio, we want to slice up the video later do we took it all.

In [None]:
process = ffmpeg.input('rDXubdQdJYs.mp4').output(
    'rDXubdQdJYs.mp3',
    vn=None,          # no video
    acodec='libmp3lame'
).run(quiet=True, overwrite_output=True)

Now we'll use [Whisper](https://github.com/openai/whisper) to transcribe it. I usually use [WhisperX](https://github.com/m-bain/whisperX) but i've had weird setup problems on other peoples' computers so we'll stick with the default.

Whisper comes in several sizes, with smaller models being faster and less effective. The newest one called `turbo` is a great combination of speed and accuracy.

> You [can't trust transcription](https://apnews.com/article/ai-artificial-intelligence-health-business-90020cdf5fa16c79ca2e5b6c4c9bbb14), but you're not an idiot so you're going to be watching these clips regardless.

In [None]:
import whisper

model = whisper.load_model("turbo")
result = model.transcribe("rDXubdQdJYs.mp3")
text = result['text']
text[:5000]

Along with the transcribed text of the audio, Whisper also provides timestamped segments. We can use that to filter for specific segments of the video!

## Time-based regions of interest

First we'll throw the **segments** into a dataframe.

In [None]:
import pandas as pd

pd.options.display.max_colwidth = None

df = pd.DataFrame(result['segments'])
df.head()

Trump loved to talk about Hunter Biden, Joe Biden's son. Let's filter for `son` to see how that went.

In [None]:
selected = df[df['text'].str.contains("son")]
selected.head()

While we can guess who's speaking at each of those moments, *can we really know?* And even if the answer is "yes," don't we want to use some AI to verify it?

Cut out those sections (images)

In [None]:
import os
import ffmpeg

filename = 'rDXubdQdJYs.mp4'
output = f'{filename}_frames'

# Ensure output directory exists
os.makedirs(output, exist_ok=True)

# Loop through rows in the DataFrame
for i, row in selected.iterrows():
    midpoint = (row['start'] + row['end']) / 2
    output_path = os.path.join(
        output, 
        f"frame_{i:03d}.jpg"
    )
    
    (
        ffmpeg
        .input(filename, ss=midpoint)
        .output(output_path, vframes=1)
        .run(quiet=True, overwrite_output=True)
    )
    print(f"Saved {output_path}")

Cut out those sections (video)

In [None]:
import ffmpeg
import os

filename = 'rDXubdQdJYs.mp4'
output = f'{filename}_clips'
buffer_seconds = 2  # seconds

os.makedirs(output, exist_ok=True)

# Loop through each row to generate clips
for idx, row in selected.iterrows():
    start = max(row['start'] - buffer_seconds, 0)
    end = row['end'] + buffer_seconds
    duration = end - start

    output_path = os.path.join(
        output,
        f"clip_{row['id']}_{int(start)}-{int(end)}.mp4"
    )

    (
        ffmpeg
        .input(filename, ss=start, t=duration)
        .output(output_path, c='copy')  # Fast copy (no re-encode)
        .run(quiet=True, overwrite_output=True)
    )

    print(f"Saved {output_path}")

Just kidding! But if you want to do the classification with either LLMs or transformers you're welcome to, the images are *right there waiting for you*.

## Space-based regions of interest

If you're looking at an individual image, you might only be interested in **part of the image**. Things like, are there guns in this image? Does this protest contain hate symbols on flags? How many journalists can fit in Bar Populaire?

These kinds of questions involve **object detection**, the process of...... detecting objects. You can [read more here](https://huggingface.co/docs/transformers/en/tasks/object_detection).

Again, we're going to use a **zero shot model**. We aren't doing anything specific or weird – only general stuff – so it should know what we're talking about.

In [None]:
from transformers import pipeline

checkpoint = "google/owlv2-base-patch16-ensemble"
detector = pipeline(model=checkpoint, task="zero-shot-object-detection")

While we *could* grab Trump and Biden's faces, let's try something else. **Cars!**

In [None]:
from PIL import Image

image = Image.open("cars/28246634.jpg")
image

What can we find in the image? Wheels? License plates? Vehicles?

In [None]:
predictions = detector(
    image,
    candidate_labels=["wheel", "license plate", "vehicle"],
)
predictions


We'll use PIL to draw some boxes and let us know what we're seeing.

In [None]:
from PIL import ImageDraw

annotated = image.copy()
draw = ImageDraw.Draw(annotated)

for prediction in predictions:
    box = prediction["box"]
    label = prediction["label"]
    score = prediction["score"]

    xmin, ymin, xmax, ymax = box.values()
    draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
    draw.text((xmin, ymin), f"{label}: {round(score,2)}", fill="white")

annotated


**That looks awful**. Let's try filtering for things with a score above 0.5.

In [None]:
from PIL import ImageDraw

annotated = image.copy()
draw = ImageDraw.Draw(annotated)

good_predictions = [pred for pred in predictions if pred['score'] > 0.5]
for prediction in good_predictions:
    box = prediction["box"]
    label = prediction["label"]
    score = prediction["score"]

    xmin, ymin, xmax, ymax = box.values()
    draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
    draw.text((xmin, ymin), f"{label}: {round(score,2)}", fill="white")

annotated


Perfect! If we're just interested in that area, we can crop it out by using the bounding box.

In [None]:
prediction = good_predictions[0]
prediction

In [None]:
bbox = [
    prediction['box']['xmin'],
    prediction['box']['ymin'],
    prediction['box']['xmax'],
    prediction['box']['ymax'],
]
cropped = image.crop(bbox)
cropped


Maybe since we cut out the rest of the image, we now have a nice tiny section we can send to an LLM for describing and analysis.

In [None]:
from typing import Literal, List
from pydantic import BaseModel, Field
from openai import OpenAI
import base64
from io import BytesIO

# Just ask "write me a Pydantic model for XXXX"
class ImageDescription(BaseModel):
    country_guess: str = Field("Best guess of the country the license plate is from")
    plate_number: str = Field("Text of license plate number")
    additional_notes: str

client = OpenAI(
    base_url='https://generativelanguage.googleapis.com/v1beta/openai/',
    api_key=GEMINI_API_KEY
)

def ask_llm_cropped(cropped):
    buffer = BytesIO()
    cropped.save(buffer, format="PNG")
    b64_image = base64.b64encode(buffer.getvalue()).decode("utf-8")

    completion = client.beta.chat.completions.parse(
        model="gemini-2.5-flash-preview-05-20",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image"},
                    {"type": "image_url", "image_url": { "url": f"data:image/png;base64,{b64_image}" } }
                ],
            },
        ],
        temperature=0,
        response_format=ImageDescription,
    )

    result = completion.choices[0].message.parsed
    return result

In [None]:
ask_llm_cropped(cropped)

And again, **we can just do it for a series of images**. Let's crop out all of the license plates in all of the `cars` folder and move them into their own folder.

In [None]:
from pathlib import Path
from PIL import Image
import glob

output_dir = Path("cars_license_plates")
output_dir.mkdir(exist_ok=True)
filenames = glob.glob("cars/*.jpg")

results = []
for filename in filenames:
    image = Image.open(filename)
    predictions = detector(
        image,
        candidate_labels=["license plate"],
    )

    base = Path(filename).stem  # filename without extension
    for i, prediction in enumerate(predictions):
        if prediction['score'] > 0.5:
            box = prediction["box"]
            xmin, ymin, xmax, ymax = box.values()
            cropped = image.crop((xmin, ymin, xmax, ymax))
            cropped_filename = output_dir / f"{base}_{i+1}.jpg"
            cropped.save(cropped_filename)
            
            result = prediction.copy()
            result['filename'] = filename
            result['cropped'] = cropped_filename
            results.append(result)

In [None]:
import pandas as pd

df = pd.json_normalize(results)
df['preview'] = df['cropped'].apply(lambda filename: f'<img src="{filename}" width="100"/>')

df.head()

In [None]:
from IPython.display import HTML

HTML(df.to_html(escape=False))

I didn't run `ask_llm_cropped` in the interests of finishing the notebook, but adding it might be a good exercise for you!