In [None]:
%pip install --quiet --upgrade "google-genai" pydantic pandas

# Understanding videos with AI

We're going to be using [Google Gemini](https://ai.google.dev/gemini-api/docs), which has [a really nice cookbook](https://github.com/google-gemini/cookbook) of examples and walkthroughs.

Most of the time when we're talking about AI we're talking about **LLMS**, Large language models. In this case Gemini (and the other popular models) aren't actually just *language* models, they're *multimodal models*, which means they can take things other than text. GPT can do things like PDFs and images, but Gemini is magical because **it can process videos**.

While you might be used to talking to AI tools through [the chat interface](https://gemini.google.com/app) we're going to be working through Python code. Technically this is called an **API**, which is a fancy way of saying "computers talking to other computers."

## Connecting to Gemini

We'll start by creating a connection to Gemini. To do this we'll need an **API key**, which is your access code. It's kind of like a username and password, so don't share it with the world! You can find guidance on how to create your own [right here](https://github.com/google-gemini/cookbook?tab=readme-ov-file#1-quick-starts). For now we'll use mine!

In [1]:
from google import genai
from google.genai import types

client = genai.Client(api_key='AIzaSyD--gLjge7h1wXqL3wRnIG1HskVS_JkrJo')

## Uploading a video

To send a video to Gemini, first you need to upload it.

In [19]:
def upload_video(video_file_name):
  video_file = client.files.upload(file=video_file_name)

  while video_file.state == "PROCESSING":
      print('Waiting for video to be processed.')
      time.sleep(10)
      video_file = client.files.get(name=video_file.name)

  if video_file.state == "FAILED":
    raise ValueError(video_file.state)
  print(f'Video processing complete: ' + video_file.uri)

  return video_file

We're goin to analyze the video [The CNN Presidential debate in 60 seconds](https://www.youtube.com/watch?v=rDXubdQdJYs):

<div style="margin: auto 10px; text-align: center;">
    <video height="500" src="rDXubdQdJYs.webm" type="video/webm" controls></video>
</div>

I have it saved here as `rDXubdQdJYs.webm`. Let's upload it!

In [None]:
video = upload_video('rDXubdQdJYs.webm')

## Ask Gemini about the video

We can start by asking various questions about the video in a *very* unstructured way.

In [26]:
from IPython.display import display, Markdown, HTML
from google.genai.types import GenerateContentConfig

prompt = """
Describe the video. Count the number of seconds Trump is alone on the screen and
the number of seconds Biden is alone on the screen
"""

video = video

response = client.models.generate_content(
    model="gemini-2.0-flash-lite",
    contents=[
        video,
        prompt,
    ],
    config=GenerateContentConfig(temperature=0)
)

print(response.text)

Here is a description of the video, including the requested counts:

**Video Description:**

The video shows clips from the CNN Presidential Debate held in Atlanta, Georgia on June 27, 2024. It features Joe Biden and Donald Trump. The clips show the candidates speaking and reacting to each other's statements. The topics discussed include border patrol, asylum officers, felony convictions, inflation, and cognitive tests. The video includes some heated exchanges between the two candidates.

**Timestamps:**

*   Trump alone on screen: 20 seconds
*   Biden alone on screen: 23 seconds


The response is in markdown, which will look nice if we allow it to:

In [25]:
from IPython.display import Markdown

Markdown(response.text)

Here is a description of the video, including the requested counts:

**Video Description:**

The video shows clips from the CNN Presidential Debate held in Atlanta, Georgia on June 27, 2024. It features Joe Biden and Donald Trump. The clips show the candidates speaking and reacting to each other's statements. The topics discussed include border patrol, asylum officers, felony convictions, inflation, and cognitive tests. The video includes some heated exchanges between the two candidates.

**Timestamps:**

*   Trump alone on screen: 15 seconds
*   Biden alone on screen: 21 seconds

----

**For now we won't fact-check it,** we'll just assume that went well.

### More detailed responses

If we want to go deeper, we scan start to ask for more details. Below, we ask for **a breakdown of each scene, along with quotes.**

In [3]:
from IPython.display import display, Markdown, HTML

prompt = """
For each scene in this video, generate captions that describe the 
scene along with any spoken text placed in quotation marks. Place each caption 
into an object with the timecode of the caption in the video.
"""

video = video

response = client.models.generate_content(
    model="gemini-2.0-flash-lite",
    contents=[
        video,
        prompt,
    ]
)

Markdown(response.text)

Okay! Here are the captions describing each scene of the video with the timecode of the caption in each object:

```json
[
  {
    "00:00": "Split screen showing Biden and Trump walking onto the stage during the presidential debate."
  },
  {
    "00:02": "Close-up of Biden speaking, “...relative to what we’re going to do with...”"
  },
  {
    "00:04": "Close-up of Trump speaking, “...more border patrol and more asylum officers.”"
  },
  {
    "00:06": "Close-up of Biden listening. Close-up of Trump listening."
  },
  {
    "00:07": "Close-up of Biden speaking, “...more border patrol and more asylum officers.”"
  },
  {
    "00:08": "Close-up of Trump speaking, “I really don’t know what he said at the end of that sentence. Don’t think he knows what he said either.”"
  },
  {
    "00:12": "Close-up of Biden speaking, “The only person on this stage that is a convicted felon is the man I’m looking at right now.”"
  },
  {
    "00:16": "Close-up of Trump speaking, “...but when he talks about a convicted felon, his son is a convicted felon.”"
  },
  {
    "00:20": "Close-up of Biden speaking, “What are you talking about? You have the morals of an alley cat.”"
  },
  {
    "00:25": "Close-up of Biden speaking, “My son was not a loser. He was not a sucker. You’re the sucker. You’re the loser.”"
  },
  {
    "00:29": "Close-up of Trump speaking, “He’s blaming inflation and he’s right it’s been very bad. He caused the inflation.”"
  },
  {
    "00:35": "Close-up of Biden speaking, “Excuse me with, dealing with everything we have to do with...”"
  },
  {
    "00:40": "Close-up of Trump listening. Close-up of Biden listening."
  },
  {
    "00:41": "Close-up of Biden speaking, “Look, if...We finally beat Medicare.”"
  },
  {
    "00:46": "Close-up of Biden listening. Question from the moderator."
  },
  {
    "00:47": "Close-up of Trump speaking, “Well, I took two tests. Cognitive tests. I aced them. Go through the first five questions. He couldn’t do it.”"
  },
  {
    "00:53": "Close-up of Biden speaking, “This guy is three years younger and a lot less competent. I think that just look at the record. Look at what I’ve done.”"
  }
]
```

## Structured responses

Instead of just humbly asking for certain approaches to formatting, we can instead use [structured outputs with Pydantic](https://platform.openai.com/docs/guides/structured-outputs). It's a way of demanding a certain output which started with OpenAI and has moved to a lot of the other providers.

If you intend to push your responses into a CSV or database or require certain fields, you should *really* be using this approach.

In [36]:
from pydantic import BaseModel, Field
from typing import List, Literal
import json

class Scene(BaseModel):
    description: str
    scene_description: str
    spoken_text: str
    speaker_name: Literal['trump', 'biden', 'both/neither']
    person_on_screen: Literal['trump', 'biden', 'both/neither']
    start_time: str = Field(..., description="Scene start timecode in format hh:mm:ss")
    end_time: str = Field(..., description="Scene end timecode in format hh:mm:ss")
    duration: int
    
class SceneList(BaseModel):
    description: str
    length_seconds: int = Field(..., description="Total duration of all scenes in seconds")
    scenes: List[Scene]

In [47]:
prompt = """
For each scene in this video, generate captions that describe the 
scene along with any spoken text placed in quotation marks. At the
conclusion of the video, count up the number of seconds of Biden
scenes vs Trump scenes and display the tabulation.
"""

response = client.models.generate_content(
    model="gemini-2.0-flash-lite",
    contents=[
        video,
        prompt,
    ],
    config=genai.types.GenerateContentConfig(
        response_mime_type="application/json",
        response_schema=SceneList,
        temperature=0
    ),
)

result = json.loads(response.text)
result

{'description': 'This video shows clips of the 2024 Presidential Debate between Joe Biden and Donald Trump.',
 'length_seconds': 59,
 'scenes': [{'description': 'Split screen of Biden and Trump walking onto the stage.',
   'scene_description': 'Split screen of Biden and Trump walking onto the stage.',
   'spoken_text': '',
   'speaker_name': 'both/neither',
   'person_on_screen': 'both/neither',
   'start_time': '00:00',
   'end_time': '00:01',
   'duration': 1},
  {'description': 'Close up of Biden speaking.',
   'scene_description': 'Close up of Biden speaking.',
   'spoken_text': "...relative to what we're going to do with",
   'speaker_name': 'biden',
   'person_on_screen': 'biden',
   'start_time': '00:01',
   'end_time': '00:03',
   'duration': 2},
  {'description': 'Close up of Trump speaking.',
   'scene_description': 'Close up of Trump speaking.',
   'spoken_text': 'more border patrol and more asylum officers',
   'speaker_name': 'trump',
   'person_on_screen': 'trump',
   'st

In [48]:
import pandas as pd
pd.options.display.max_colwidth = 200

df = pd.json_normalize(result['scenes'])
df

Unnamed: 0,description,scene_description,spoken_text,speaker_name,person_on_screen,start_time,end_time,duration
0,Split screen of Biden and Trump walking onto the stage.,Split screen of Biden and Trump walking onto the stage.,,both/neither,both/neither,00:00,00:01,1
1,Close up of Biden speaking.,Close up of Biden speaking.,...relative to what we're going to do with,biden,biden,00:01,00:03,2
2,Close up of Trump speaking.,Close up of Trump speaking.,more border patrol and more asylum officers,trump,trump,00:03,00:06,3
3,Close up of Biden speaking.,Close up of Biden speaking.,more border patrol and more asylum officers,biden,biden,00:06,00:08,2
4,Close up of Trump speaking.,Close up of Trump speaking.,I really don't know what he said at the end of that sentence.,trump,trump,00:08,00:11,3
5,Close up of Biden speaking.,Close up of Biden speaking.,The only person on this stage that is a convicted felon,biden,biden,00:12,00:15,3
6,Close up of Trump speaking.,Close up of Trump speaking.,"...but when he talks about a convicted felon, his son is a convicted felon.",trump,trump,00:16,00:20,4
7,Close up of Biden speaking.,Close up of Biden speaking.,What are you talking about? You have the morals of an alley cat.,biden,biden,00:20,00:24,4
8,Close up of Biden speaking.,Close up of Biden speaking.,My son was not a loser. He was not a sucker. You're the sucker. You're the loser.,biden,biden,00:24,00:29,5
9,Close up of Trump speaking.,Close up of Trump speaking.,He's blaming inflation and he's right it's been very bad. He caused the inflation.,trump,trump,00:29,00:35,6


In [49]:
df.groupby('speaker_name')['duration'].sum()

speaker_name
biden           30
both/neither     1
trump           25
Name: duration, dtype: int64

Seems great, right? Perfect, right?

## Let's run it again.

And again. and again?

|Speaker|Run 1|Run 2|Run 3|Run 4|
|---|---|---|---|---|
|Biden|30|31|32|30|
|Trump|1|1|1|1|
|Both/neither|26|26|24|25|

...the *numerical result changes each time*. It might be close to the same each time, but is that good enough? And is it even accurate? 

You might want to use this for **qualitative work** and maybe not for **quantative**.

In the [next notebook](02-Video%20scene%20and%20image%20analysis.ipynb) we'll look at how to do this with a little more nuance and accuracy.