In [None]:
%pip install --quiet --upgrade transformers timm "scenedetect[opencv]"==0.6.4
%pip install --quiet --upgrade openai "google-genai" pydantic pandas 
%pip install --quiet --upgrade ffmpeg-python pillow tqdm

In [None]:
# API KEYS GO HERE, ASK SOMA!!

# Analyzing videos with AI

Now that we've done a little image analysis, now we can look at videos. Remember that **videos are just images and audio and a little bit of time**, and transitioning between the formats is often a convenient way to analyze content effectively.

## Downloading videos

To download the videos we'll use [yt-dlp](https://github.com/yt-dlp/yt-dlp), the most fantastic piece of software ever written. It can download content from *anywhere*. Give it a URL to YouTube, TikTok, Instagram, old weird websites with videos: a-n-y-w-h-e-r-e!

It works by doing a little magic scraping, though, so you'll often need to `--upgrade` it with the line below. Even running a version a couple months old is likely to end up with unsuccessful downloads.

In [None]:
%pip install --quiet --upgrade yt-dlp

yt-dlp looks the best on the command line, buuuut I'll keep things cleaner and give you the Python code. I always always always ask a chatbot to write the `ydl_opts` for me, I absolutely cannot remember any of them.

In [None]:
import yt_dlp

url = "https://www.youtube.com/watch?v=rDXubdQdJYs"

ydl_opts = {
    'format': 'bestvideo[height<=720][vcodec!^=av01]+bestaudio/best[height<=720]',
    'outtmpl': '%(id)s.%(ext)s',
    'merge_output_format': 'mp4',
}

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    ydl.download([url])

And now we have our video!

In [None]:
from IPython.display import Video

Video("rDXubdQdJYs.mp4")

## Sending videos to an LLM

Google Gemini is the only LLM that can understand videos right now - see
https://ai.google.dev/gemini-api/docs/video-understanding for many, many, MANY examples of how to use it.

For these features we also need to use Google's `google-genai` package instead of the OpenAI one. Sorry!!! Also note that the package is *not* `google-generativeai`, which is Google's previous AI library. I do not name things around here.

We also can't send the video through base64, it would be giant and blow up the planet. Instead we need to upload the video to Google before we analyze it.

In [None]:
from google import genai
from google.genai import types
import time

client = genai.Client(api_key=GEMINI_API_KEY)

def upload_video(video_file_name):
  video_file = client.files.upload(file=video_file_name)

  while video_file.state == "PROCESSING":
      print('Waiting for video to be processed.')
      time.sleep(10)
      video_file = client.files.get(name=video_file.name)

  if video_file.state == "FAILED":
    raise ValueError(video_file.state)
  print(f'Video processing complete: ' + video_file.uri)

  return video_file

That was all setup: now we upload it.

In [None]:
video = upload_video('rDXubdQdJYs.mp4')

After the video is uploaded, we can finally make a request. You can put **anything** in the prompt – again, you should definitely check out the [video understanding](https://ai.google.dev/gemini-api/docs/video-understanding) page.

In [None]:
from IPython.display import display, Markdown, HTML
from google.genai.types import GenerateContentConfig

prompt = """
Describe the video.
"""

video = video

response = client.models.generate_content(
    model="gemini-2.0-flash-lite",
    contents=[
        video,
        prompt,
    ],
    config=GenerateContentConfig(temperature=0)
)

print(response.text)

We're trying to see whether Biden or Trump got more screen time (bias!!! bias!!!) so we'll go the easiest route first: **we can just ask Gemini.**

In [None]:
from IPython.display import display, Markdown, HTML
from google.genai.types import GenerateContentConfig

prompt = """
Describe the video. Count the number of seconds Trump is alone on the screen and
the number of seconds Biden is alone on the screen
"""

video = video

response = client.models.generate_content(
    model="gemini-2.0-flash-lite",
    contents=[
        video,
        prompt,
    ],
    config=GenerateContentConfig(temperature=0)
)

print(response.text)

Great! Wonderful! A response! **But is it correct?**

We'll find out later.

## Analyzing videos *directly from YouTube*

A fun trick when working with YouTube videos is that you can feed them directly to Gemini. You can even analyze like a *two hour long video* this way, it's wild. Only available with Gemini, obviously.

> Sometimes you get [weird errors](https://github.com/googleapis/python-genai/issues/378), but the code below seems to work for now.

In [None]:
from google import genai
from google.genai.types import Part, HttpOptions

client = genai.Client(
    api_key=GEMINI_API_KEY,
    http_options=HttpOptions(api_version="v1alpha")
)

prompt = """
Transcribe the audio from this video, giving timestamps for salient events in the video. 
Also provide visual descriptions.
"""

response = client.models.generate_content(
    model='models/gemini-2.0-flash',
    contents= types.Content(
        parts=[
            types.Part(text=prompt),
            types.Part(
                file_data=types.FileData(file_uri='https://www.youtube.com/watch?v=rDXubdQdJYs')
            )
        ]
    )
)

print(response.text)

So now we can do the same "count the screen time" thing we did before, but this time instead of uploading the video we can just pass it the YouTube URL.

In [None]:
from google import genai
from google.genai.types import Part, HttpOptions

client = genai.Client(
    api_key=GEMINI_API_KEY,
    http_options=HttpOptions(api_version="v1alpha")
)

prompt = """
Describe the video. Count the number of seconds Trump is alone on the screen and
the number of seconds Biden is alone on the screen
"""

response = client.models.generate_content(
    model='models/gemini-2.0-flash',
    contents= types.Content(
        parts=[
            types.Part(text=prompt),
            types.Part(
                file_data=types.FileData(file_uri='https://www.youtube.com/watch?v=rDXubdQdJYs')
            )
        ]
    )
)

print(response.text)

Did the two results match? **How can we verify them?**

## Manually splitting videos into frames

Instead of analyzing the video as a holistic object, we can also just break them down into frames. The code below uses ffmpeg – the craziest, most capable video library ever – to pull out a frame of the video every 2 seconds.

We can then see who is on screen when and add it up for a rough count.

In [None]:
import ffmpeg
from pathlib import Path

interval_seconds = 2
output_dir = Path("debate")
output_dir.mkdir(exist_ok=True)
output_pattern = str(output_dir / "frame-%03d.jpg")

(
    ffmpeg
    .input("rDXubdQdJYs.mp4")
    .output(output_pattern, vf=f'fps=1/{interval_seconds}')
    .run(capture_stdout=True, capture_stderr=True)
)

print(f"Frames saved to '{output_dir}' (1 every {interval_seconds} seconds)")

Let's look at a single image...

In [None]:
from IPython.display import Image

Image("debate/frame-004.jpg")

*We* know who that is, but we're lazy! Let's leverage our skills from before and use Pydantic and an LLM to classify the image as being either:

- Joe Biden
- Donald Trump
- Both/None/Other

In [None]:
from openai import OpenAI
from typing import Literal, List
from pydantic import BaseModel, Field
import base64

class ImageDescription(BaseModel):
    person: Literal['Joe Biden', 'Donald Trump', 'Both/None/Other'] = Field("What politician is in this image?")

client = OpenAI(
    base_url='https://generativelanguage.googleapis.com/v1beta/openai/',
    api_key=GEMINI_API_KEY
)

def ask_llm(filename):
    with open(filename, "rb") as image_file:
        b64_image = base64.b64encode(image_file.read()).decode("utf-8")
    
    completion = client.beta.chat.completions.parse(
        model="gemini-2.5-flash-preview-05-20",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": "Describe this image"},
                    {"type": "image_url", "image_url": { "url": f"data:image/png;base64,{b64_image}" } }
                ],
            },
        ],
        temperature=0,
        response_format=ImageDescription
    )
    
    result = completion.choices[0].message.parsed
    return result

That code allows us to use `ask_llm` with a filename to get a response. How well does it work?

In [None]:
ask_llm('debate/frame-004.jpg')

In [None]:
Image('debate/frame-007.jpg')

In [None]:
ask_llm('debate/frame-007.jpg')

In [None]:
Image('debate/frame-001.jpg')

In [None]:
ask_llm('debate/frame-001.jpg')

Looking pretty good to me! Now let's **process all the frames**, just like we did in the last notebook when working with individual, unrelated images.

In [None]:
import glob

filenames = glob.glob("debate/*.jpg")
filenames = sorted(filenames)
filenames[:5]

In [None]:
from tqdm import tqdm

results = []
for filename in tqdm(filenames):
    result = ask_llm(filename)
    results.append(result)

In [None]:
import pandas as pd

# Build into dataframe
data = [result.model_dump() for result in results]
df = pd.DataFrame(data)

# Add new columns
df['filename'] = filenames
df['preview'] = df['filename'].apply(lambda filename: f'<img src="{filename}" width="100"/>')

df.head()

Let's count them up and see how it looks!

In [None]:
df['person'].value_counts()

Does that match what Gemini said? And if not, how might we still be wrong?

...and while we're at it: **let's verify the results.**

In [None]:
from IPython.display import HTML

HTML(df.to_html(escape=False))

## Splitting on scenes

If you do this kind of analysis on anything that isn't a quick-cuts YouTube short, you might end up spending a *lot* of time and money on the process. Imagine if Biden was on the screen for 6 minutes – that's 360 images you're processing for *no good reason*.

So let's try another approach: [pySceneDetect](https://github.com/Breakthrough/PySceneDetect) is an incredible library that uses basis algorithms to split videos into scenes. It can save CSV files describing the splits, save an array of images from each scene, make a nice little HTML guide, cut out image clips, all sorts of things. You should read the documentation!

Below we're going to detect a change in scenes using the `ContentDetector` and save some details about each scene (images, a csv, and an HTML guide).

In [None]:
from scenedetect import detect, ContentDetector, open_video
from scenedetect.scene_manager import save_images, write_scene_list_html, write_scene_list
import os

video_path = "rDXubdQdJYs.mp4"
output_dir = "debate-scenes"
image_width = 300
num_images = 5

os.makedirs(output_dir, exist_ok=True)

scene_list = detect(video_path, ContentDetector())

video_stream = open_video(video_path)

save_images(
    scene_list,
    video_stream,
    output_dir=output_dir,
    num_images=num_images,
    width=image_width,
)

write_scene_list_html(
    scene_list=scene_list,
    output_html_filename=os.path.join(output_dir, "scenes.html"),
    image_width=image_width,
)

csv_path = os.path.join(output_dir, "scenes.csv")
with open(csv_path, "w", encoding="utf-8") as csvfile:
    write_scene_list(
        output_csv_file=csvfile,
        scene_list=scene_list,
        include_cut_list=False,
    )

# Print start/end timecodes for each scene
for idx, (start, end) in enumerate(scene_list, 1):
    print(f"Scene {idx}: {start.get_timecode()} – {end.get_timecode()}")

If you're running this on your own computer, you can open up the HTML file yourself. If not..................... it's fine, just watch me.

We can also get details on every scene – including length – by looking at the CSV file it built.

In [None]:
import pandas as pd

df = pd.read_csv("debate-scenes/scenes.csv")
df.head()

In order to see who is talking in each scene, I think taking the middle image of the scene should be a reasonable representation. If we have 5 images from each scene, the 3rd of each should be what we're looking for.

We're using glob again below, but adjusting the pattern to only take the `-03.jpg` images.

In [None]:
import glob

# Get the third image of each scene
filenames = glob.glob("debate-scenes/*-03.jpg")
filenames = sorted(filenames)
filenames[:5]

Now we can ask the LLM again.

In [None]:
answers = []
for filename in filenames:
    answer = ask_llm(filename)
    answers.append(answer)
    print(f"{filename} received response {answer}")

It was less time, even though it might have been more work. How did it turn out? Let's *not really pay attention yet*, we have more information to combine with it.

In [None]:
dicts = [obj.model_dump() for obj in answers]

results = pd.DataFrame(dicts)
results['filename'] = filenames
results['preview'] = results.filename.apply(lambda filename: f'<img src="{filename}" width="100"/>')

results.head()

### Combining results with scene data

By adding in the scene data we can get a *lot* of additional information beyond just "there was a scene with this image."

In [None]:
merged = df.join(results)
merged.head()

In [None]:
from IPython import display

# Grab five of them
display.HTML(merged.to_html(escape=False))

And now, most importantly: we can count up the seconds, grouped by each person.

In [None]:
merged.groupby('person')['Length (seconds)'].sum()

## Local models: faster, more private, more information

Let's say you're doing something like this, but *it's a bunch of secret stuff*. You don't want to share it with OpenAI or Google or anyone like that. You're also tired of giving money to Big Tech, and every request takes way too long!

The answer is **local models!** As long as your question isn't *too complicated*, code that runs on your own computer can probably handle it. You'll want to investigate [zero-shot classification models](https://huggingface.co/tasks/zero-shot-image-classification), which means "this model can put things into categories without you teaching it anything."

Vision-capable LLMs are huge, giant beasts that you could never hope to run. But once you start to move beyond an all-powerful single tool, a [million tiny tools](https://huggingface.co/models) appear that [can do much of what you're looking for](https://huggingface.co/tasks).

In [None]:
from PIL import Image

filename = 'debate-scenes/rDXubdQdJYs-Scene-002-03.jpg'
image = Image.open(filename)
image

[OpenAI's clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) isn't *new*, but it will work fine for our cases (and it's plenty popular). Let's see how it does on one of these images.

In [None]:
from transformers import pipeline

# Load a classifier
detector = pipeline("zero-shot-image-classification", model="openai/clip-vit-large-patch14") 

# Use the classifier
results = detector(image, candidate_labels=["donald trump", "joe biden"])
results

Amazing???? And we get a confidence score????? And it was so fast???????

### Run the classifier on ALL the images

Well, not *all* the images, just the middle image. But close enough.

In [None]:
import glob

# Get the third image of each scene
filenames = glob.glob("debate-scenes/*-03.jpg")
filenames = sorted(filenames)
filenames[:5]

Take a close look at the code below: if the confidence score is below 98% we mark it as "unknown." This is the kind of flexibility you get when you move away from LLMs!

In [None]:
from transformers import pipeline
from PIL import Image

# Build the detector
detector = pipeline("zero-shot-image-classification", model="openai/clip-vit-large-patch14") 

answers = []
for filename in filenames:
    # Open the image
    image = Image.open(filename)

    # See the results
    results = detector(image, candidate_labels=["donald trump", "joe biden"])
    top_result = results[0]

    top_score = top_result['score']
    # Be 98% sure
    if top_result['score'] > 0.98:
        label = top_result['label']
    else:
        label = 'unknown'

    answers.append({
        'filename': filename,
        'label': label,
        'score': top_score,
        'preview': f'<img src="{filename}" width="100"/>'
    })

    print(filename, label, top_score)

Now let's do the typical combining...

In [None]:
import pandas as pd

results_df = pd.DataFrame(answers)
results_df.head()

In [None]:
import pandas as pd

df = pd.read_csv("debate-scenes/scenes.csv")
df.head()

...and how's it look?

In [None]:
from IPython.display import HTML

merged = results_df.join(df)

HTML(merged.to_html(escape=False))

In [None]:
merged.to_csv("merged.csv", index=False)

And now, the **final calculation!**

In [None]:
merged.groupby('label')['Length (seconds)'].sum()

We can also nudge up the required confidence score if we really want to be *certain* certain.

In [None]:
merged.query('score > 0.99').groupby('label')['Length (seconds)'].sum()