##### Copyright 2024 Google LLC.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Video understanding with Gemini

<a target="_blank" href="https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Video_understanding.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=30/>

Gemini has from the begining been a multimodal model, capable of analyzing all sorts of medias using its [long context window](https://developers.googleblog.com/en/new-features-for-the-gemini-api-and-google-ai-studio/).

[Gemini 2.0](https://ai.google.dev/gemini-api/docs/models/gemini-v2) and later bring video analysis to a whole new level as illustrated in [this video](https://www.youtube.com/watch?v=Mot-JEU26GQ):


In [2]:
#@title Building with Gemini 2.0: Video understanding
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed/Mot-JEU26GQ?si=pcb7-_MZTSi_1Zkw" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

This notebook will show you how to easily use Gemini to perform the same kind of video analysis. Each of them has different prompts that you can select using the dropdown, also feel free to experiment with your own.

You can also check the [live demo](https://aistudio.google.com/starter-apps/video) and try it on your own videos on [AI Studio](https://aistudio.google.com/starter-apps/video).

## Setup

This section install the SDK, set it up using your [API key](../quickstarts/Authentication.ipynb), imports the relevant libs, downloads the sample videos and upload them to Gemini.

Expand the section if you are curious, but you can also just run it (it should take a couple of minutes since there are large files) and go straight to the examples.

### Install SDK

The new **[Google Gen AI SDK](https://ai.google.dev/gemini-api/docs/sdks)** provides programmatic access to Gemini 2.0 (and previous models) using both the [Google AI for Developers](https://ai.google.dev/gemini-api/docs) and [Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/overview) APIs. With a few exceptions, code that runs on one platform will run on both. This means that you can prototype an application using the Developer API and then migrate the application to Vertex AI without rewriting your code.

More details about this new SDK on the [documentation](https://ai.google.dev/gemini-api/docs/sdks) or in the [Getting started](../gemini-2/get_started.ipynb) notebook.

In [3]:
%pip install -U -q 'google-genai'

### Setup your API key

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see [Authentication](../quickstarts/Authentication.ipynb) for an example.

In [4]:
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

### Initialize SDK client

With the new SDK you now only need to initialize a client with you API key (or OAuth if using [Vertex AI](https://cloud.google.com/vertex-ai)). The model is now set in each call.

In [5]:
from google import genai
from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

### Select the Gemini model

Video understanding works best Gemini 2.5 pro model. You can also select former models to compare their behavior but it is recommended to use at least the 2.0 ones.

For more information about all Gemini models, check the [documentation](https://ai.google.dev/gemini-api/docs/models/gemini) for extended information on each of them.


In [6]:
model_name = "gemini-2.5-pro-exp-03-25" # @param ["gemini-1.5-flash-latest","gemini-2.0-flash-lite","gemini-2.0-flash","gemini-2.5-pro-exp-03-25"] {"allow-input":true, isTemplate: true}

### Get sample videos

You will start with uploaded videos, as it's a more common use-case, but you will also see later that you can also use Youtube videos.

In [7]:
# Load sample images
!wget https://storage.googleapis.com/generativeai-downloads/videos/Pottery.mp4 -O Pottery.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/Jukin_Trailcam_Videounderstanding.mp4 -O Trailcam.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/post_its.mp4 -O Post_its.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/user_study.mp4 -O User_study.mp4 -q

### Upload the videos

Upload all the videos using the File API. You can find modre details about how to use it in the [Get Started](../gemini-2/get_started.ipynb#scrollTo=KdUjkIQP-G_i) notebook.

This can take a couple of minutes as the videos will need to be processed and tokenized.

In [8]:
import time

def upload_video(video_file_name):
  video_file = client.files.upload(file=video_file_name)

  while video_file.state == "PROCESSING":
      print('Waiting for video to be processed.')
      time.sleep(10)
      video_file = client.files.get(name=video_file.name)

  if video_file.state == "FAILED":
    raise ValueError(video_file.state)
  print(f'Video processing complete: ' + video_file.uri)

  return video_file

pottery_video = upload_video('Pottery.mp4')
trailcam_video = upload_video('Trailcam.mp4')
post_its_video = upload_video('Post_its.mp4')
user_study_video = upload_video('User_study.mp4')

Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/a8oqdle18w7l
Waiting for video to be processed.
Waiting for video to be processed.
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/up8zhje3r0kt
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/h3edbtghktrw
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/mggylqo1cari


### Imports

In [9]:
import json
from PIL import Image
from IPython.display import display, Markdown, HTML

# Search within videos

First, try using the model to search within your videos and describe all the animal sightings in the trailcam video.

<video controls width="500"><source src="https://storage.googleapis.com/generativeai-downloads/videos/Jukin_Trailcam_Videounderstanding.mp4" type="video/mp4"></video>

In [10]:
prompt = "For each scene in this video, generate captions that describe the scene along with any spoken text placed in quotation marks. Place each caption into an object with the timecode of the caption in the video."  # @param ["For each scene in this video, generate captions that describe the scene along with any spoken text placed in quotation marks. Place each caption into an object with the timecode of the caption in the video.", "Organize all scenes from this video in a table, along with timecode, a short description, a list of objects visible in the scene (with representative emojis) and an estimation of the level of excitement on a scale of 1 to 10"] {"allow-input":true}

video = trailcam_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=model_name,
    contents=[
        video,
        prompt,
    ]
)

Markdown(response.text)

```json
[
  {"timecode": "00:00:00", "caption": "The camera view is obscured by the fur of an animal sniffing it, before pulling back to show two gray foxes exploring a rocky, leaf-strewn woodland area during the day."},
  {"timecode": "00:11:00", "caption": "One of the gray foxes leaps onto a large boulder while the other continues sniffing the ground."},
  {"timecode": "00:17:00", "caption": "In black and white infrared footage, a mountain lion walks through the same woodland area, sniffing the ground intently."},
  {"timecode": "00:28:00", "caption": "The mountain lion pauses, shakes its head, looks around, and continues sniffing before walking out of frame."},
  {"timecode": "00:34:00", "caption": "At night, under infrared light, two gray foxes are in the woodland. One sniffs the ground while the other rolls playfully on its back."},
  {"timecode": "00:43:00", "caption": "The rolling fox sits up as the other approaches. They interact briefly before darting away quickly."},
  {"timecode": "00:50:00", "caption": "Three gray foxes are now visible in the rocky area at night. One knocks the camera over as it passes."},
  {"timecode": "00:54:00", "caption": "Two foxes scramble up the rocks while one remains lower down, sniffing around before also climbing."},
  {"timecode": "01:04:00", "caption": "A mountain lion stands on the rocks at night, looking around, before walking further up and disappearing behind the boulders."},
  {"timecode": "01:17:00", "caption": "Another mountain lion, possibly a cub, appears briefly on the rocks in the background."},
  {"timecode": "01:18:00", "caption": "An adult mountain lion walks towards the camera and then turns right, while a smaller mountain lion (cub) walks along the rocks behind it."},
  {"timecode": "01:28:00", "caption": "A bobcat stands in the woodland at night, illuminated by infrared light."},
  {"timecode": "01:32:00", "caption": "The bobcat sniffs the ground, briefly rolls onto its side, gets up, looks around, and then continues sniffing."},
  {"timecode": "01:50:00", "caption": "During the day, a large black bear walks towards the camera in the woodland, then turns and walks away."},
  {"timecode": "01:56:00", "caption": "Under infrared light, a mountain lion walks briskly through the woods from left to right."},
  {"timecode": "02:04:00", "caption": "A bear cub bumps the camera with its nose."},
  {"timecode": "02:07:00", "caption": "The bear cub walks away as another cub enters the frame, sniffs the ground, and follows the first."},
  {"timecode": "02:17:00", "caption": "The two bear cubs sniff the ground near a tree and then walk off together."},
  {"timecode": "02:22:00", "caption": "At night, a gray fox stands on a ridge overlooking the distant lights of a city, sniffing the ground."},
  {"timecode": "02:34:00", "caption": "A black bear walks along the same ridge at night, passing in front of the camera with the city lights in the background."},
  {"timecode": "02:41:00", "caption": "A mountain lion walks along the ridge overlooking the city lights at night."},
  {"timecode": "02:51:00", "caption": "At night, a mountain lion backs up to a tree and scent-marks it by spraying."},
  {"timecode": "02:56:00", "caption": "The mountain lion turns, sniffs the ground where it marked, looks up briefly, and continues sniffing."},
  {"timecode": "03:04:00", "caption": "An adult black bear stands in the sunlit woodland, facing the camera."},
  {"timecode": "03:09:00", "caption": "The bear looks around, sniffs the air with its mouth slightly open, and then walks towards the camera."},
  {"timecode": "03:22:00", "caption": "A light brown (cinnamon phase) bear cub stands sideways to the camera in the woodland."},
  {"timecode": "03:26:00", "caption": "The cub sniffs the ground, while a second, darker cub appears behind it."},
  {"timecode": "03:31:00", "caption": "The first cub walks towards the camera, followed closely by the second cub."},
  {"timecode": "03:40:00", "caption": "The two cubs are now very close to the camera, sniffing the ground. A third cub walks past in the background."},
  {"timecode": "03:49:00", "caption": "The first cub sits down facing away from the camera and scratches its side with a hind leg."},
  {"timecode": "03:57:00", "caption": "The sitting cub gets up, and all three cubs walk away from the camera together."},
  {"timecode": "04:03:00", "caption": "A light brown bear cub walks towards the camera, followed by another."},
  {"timecode": "04:11:00", "caption": "The two cubs stand near each other, looking around the woodland."},
  {"timecode": "04:21:00", "caption": "At night, a bobcat sits in the undergrowth, looking towards the camera with glowing eyes."},
  {"timecode": "04:24:00", "caption": "The bobcat looks away, then walks off into the darkness."},
  {"timecode": "04:29:00", "caption": "A gray fox emerges from the undergrowth at night, its eyes reflecting the infrared light."},
  {"timecode": "04:35:00", "caption": "The fox sniffs around near a fallen log, looks at the camera, and continues exploring."},
  {"timecode": "04:44:00", "caption": "Another gray fox appears briefly in the background and runs off quickly."},
  {"timecode": "04:49:00", "caption": "The first gray fox walks away from the camera into the dark woods."},
  {"timecode": "04:57:00", "caption": "A mountain lion sniffs the ground near the base of a tree at night."},
  {"timecode": "05:03:00", "caption": "The mountain lion looks up briefly, then turns and walks away from the camera."}
]
```

The prompt used is quite a generic one, but you can get even better results if you cutomize it to your needs (like asking specifically for foxes).

The [live demo on AI Studio](https://aistudio.google.com/starter-apps/video) shows how you can postprocess this output to jump directly to the the specific part of the video by clicking on the timecodes. If you are interested, you can check the [code of that demo on Github](https://github.com/google-gemini/starter-applets/tree/main/video).

# Extract and organize text

Gemini can also read what's in the video and extract it in an organized way. You can even use Gemini reasoning capabilities to generate new ideas for you.

<video controls width="400"><source src="https://storage.googleapis.com/generativeai-downloads/videos/post_its.mp4" type="video/mp4"></video>

In [11]:
prompt = "Transcribe the sticky notes, organize them and put it in a table. Can you come up with a few more ideas?" # @param ["Transcribe the sticky notes, organize them and put it in a table. Can you come up with a few more ideas?", "Which of those names who fit an AI product that can resolve complex questions using its thinking abilities?"] {"allow-input":true}

video = post_its_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=model_name,
    contents=[
        video,
        prompt,
    ]
)

Markdown(response.text)

Okay, here is the transcription of the sticky notes from the video, organized alphabetically into a table, followed by a few additional brainstormed ideas in a similar style.

**Transcribed Project Name Ideas:**

|                   |                     |                     |                    |
| :---------------- | :------------------ | :------------------ | :----------------- |
| Aether            | Chaos Field         | Euler's Path        | Odin               |
| Andromeda's Reach | Chaos Theory        | Fractal             | Orion's Belt       |
| Astral Forge      | Chimera Dream       | Galactic Core       | Orion's Sword      |
| Athena            | Comet's Tail        | Golden Ratio        | Pandora's Box      |
| Athena's Eye      | Convergence         | Hera                | Perseus Shield     |
| Bayes Theorem     | Delphinus           | Infinity Loop       | Phoenix            |
| Canis Major       | Draco               | Leo Minor           | Prometheus Rising  |
| Centaurus         | Echo                | Lunar Eclipse       | Riemann's Hypoth.  |
| Cerberus          | Equilibrium         | Lynx                | Sagitta            |
| Celestial Drift   |                     | Lyra                | Serpens            |
| Stellar Nexus     | Stokes Theorem      | Supernova Echo      | Symmetry           |
| Taylor Series     | Titan               | Vector              | Zephyr             |

*(Note: "Riemann's Hypothesis" was abbreviated slightly in the table for space.)*

**Additional Brainstormed Ideas:**

1.  **Event Horizon:** (Astronomy/Physics - Point of no return near a black hole)
2.  **Project Icarus:** (Mythology - Referencing ambition and flight)
3.  **Quantum Leap:** (Physics - Sudden change in state, implies advancement)
4.  **Fibonacci Spiral:** (Mathematics/Nature - Pattern of growth and beauty)
5.  **Vanguard:** (General/Abstract - Leading position, forefront of development)
6.  **Nebula Prime:** (Astronomy - Primary or first cloud of gas/dust)
7.  **Axiom Core:** (Mathematics/Logic - Fundamental truth or core principle)

# Structure information

Gemini 2.0 is not only able to read text but also to reason and structure about real world objects. Like in this video about a display of ceramics with handwritten prices and notes.

<video controls width="500"><source src="https://storage.googleapis.com/generativeai-downloads/videos/Pottery.mp4" type="video/mp4"></video>

In [12]:
prompt = "Give me a table of my items and notes" # @param ["Give me a table of my items and notes", "Help me come up with a selling pitch for my potteries"] {"allow-input":true}

video = pottery_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=model_name,
    contents=[
        video,
        prompt,
    ],
    config = types.GenerateContentConfig(
        system_instruction="Don't forget to escape the dollar signs",
    )
)

Markdown(response.text)

Okay, here is a table summarizing the items and notes shown in the image:

| Item          | Description / Notes                       | Price |
| :------------ | :---------------------------------------- | :---- |
| Tumblers      | Glaze: #5 Artichoke double dip<br>4"h x 3"d (-ish) | \$20  |
| Small bowls   | 3.5"h x 6.5"d                             | \$35  |
| Med bowls     | 4"h x 7"d                                 | \$40  |
| *Glaze Info*  | #5 Artichoke double dip (Test tile shown) | N/A   |
| *Glaze Info*  | #6 Gemini double dip, SLOW COOL (Test tile shown, marked 6rb) | N/A   |

**Note:** The glaze for the small and medium bowls appears to be the "#6 Gemini double dip" based on visual similarity to the test tile, although the notes next to the bowls don't explicitly state the glaze name.

As you can see, Gemini is able to grasp to with item corresponds each note, including the last one.

# Analyze screen recordings for key moments

You can also use the model to analyze screen recordings. Let's say you're doing user studies on how people use your product, so you end up with lots of screen recordings, like this one, that you have to manually comb through.
With just one prompt, the model can describe all the actions in your video.

<video controls width="400"><source src="https://storage.googleapis.com/generativeai-downloads/videos/user_study.mp4" type="video/mp4"></video>

In [13]:
prompt = "Generate a paragraph that summarizes this video. Keep it to 3 to 5 sentences with corresponding timecodes." # @param ["Generate a paragraph that summarizes this video. Keep it to 3 to 5 sentences with corresponding timecodes.", "Choose 5 key shots from this video and put them in a table with the timecode, text description of 10 words or less, and a list of objects visible in the scene (with representative emojis).", "Generate bullet points for the video. Place each bullet point into an object with the timecode of the bullet point in the video."] {"allow-input":true}

video = user_study_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=model_name,
    contents=[
        video,
        prompt,
    ]
)

Markdown(response.text)

Here is a summary of the video:

(00:00-00:10) The video displays a mobile application called "My Garden App" showcasing various plants available for purchase.
(00:10-00:17) The user interacts with the app by clicking the "Like" button for the Rose Plant, Fern, and Cactus, turning the buttons red.
(00:13-00:25) They proceed to add the Fern, Cactus, and Hibiscus plants to the shopping cart, indicated by the "Add to Cart" button briefly changing to "Added!".
(00:29-00:34) The user navigates to the "Cart" tab, showing the three selected items and the total price, and then briefly views the "Profile" tab showing counts for liked plants and cart items.
(00:37-00:45) After returning to the home screen, the user unlikes the Hibiscus, likes the Snake Plant, and adds the Orchid to their cart.

# Analyze youtube videos

On top of using your own videos you can also ask Gemini to get a video from Youtube and analyze it. He's an example using the keynote from Google IO 2023. Guess what the main theme was?


In [14]:
response = client.models.generate_content(
    model=model_name,
    contents=types.Content(
        parts=[
            types.Part(text="Find all the instances where Sundar says \"AI\". Provide timestamps and broader context for each instance."),
            types.Part(
                file_data=types.FileData(file_uri='https://www.youtube.com/watch?v=ixRanV-rdAQ')
            )
        ]
    )
)

Markdown(response.text)

Okay, here are all the instances where Sundar Pichai says "AI" in the provided video, along with timestamps and context:

1.  **0:29** - Context: Setting the stage for the keynote, acknowledging the current focus on AI.
    > "As you may have heard, **AI** is having a very busy year."

2.  **0:38** - Context: Highlighting Google's long-term commitment and shift towards AI.
    > "Seven years into our journey as an **AI**-first company, we are at an exciting inflection point."

3.  **0:45** - Context: Describing the goal and potential impact of AI development.
    > "We have an opportunity to make **AI** even more helpful for people, for businesses, for communities, for everyone."

4.  **0:54** - Context: Referring to Google's ongoing use of AI to improve its products.
    > "We've been applying **AI** to make our products radically more helpful for a while."

5.  **1:41** - Context: Discussing advancements in Google Workspace features like Smart Compose.
    > "Smart Compose led to more advanced writing features powered by **AI**."

6.  **3:02** - Context: Explaining the technology behind Google Street View.
    > "Since the early days of Street View, **AI** has stitched together billions of panoramic images..."

7.  **3:15** - Context: Describing the technology used for the Immersive View feature in Google Maps.
    > "...Immersive View, which uses **AI** to create a high-fidelity representation of a place..."

8.  **5:08** - Context: Introducing Google Photos as an example of an AI-enhanced product.
    > "Another product made better by **AI** is Google Photos."

9.  **5:38** - Context: Explaining how AI enables more powerful photo editing features.
    > "**AI** advancements give us more powerful ways to do this."

10. **7:40** - Context: Summarizing the product examples shown (Gmail, Photos, Maps).
    > "...these are just a few examples of how **AI** can help you in moments that matter."

11. **7:47** - Context: Stating the broader goal of leveraging AI across all Google products.
    > "...so much more we can do to deliver the full potential of **AI** across the products you know and love."

12. **8:22** - Context: Positioning AI as central to Google's mission going forward.
    > "Looking ahead, making **AI** helpful for everyone is the most profound way we will advance our mission."

13. **8:53** - Context: Highlighting the importance of responsible AI development and deployment.
    > "...by building and deploying **AI** responsibly so that everyone can benefit equally."

14. **9:02** - Context: Linking the advancement of foundation models to the goal of making AI helpful.
    > "Our ability to make **AI** helpful for everyone relies on continuously advancing our foundation models."

15. **11:26** - Context: Describing the capabilities of Sec-PaLM, a security-focused AI model.
    > "It uses **AI** to better detect malicious scripts..."

16. **12:14** - Context: Discussing future potential applications of Med-PaLM 2 in healthcare.
    > "You can imagine an **AI** collaborator that helps radiologists interpret images..."

17. **12:46** - Context: Framing PaLM 2 within Google's long-term vision for responsible AI.
    > "...latest step in our decade-long journey to bring **AI** in responsible ways to billions of people."

18. **14:09** - Context: Emphasizing the parallel investment in AI responsibility alongside model development.
    > "As we invest in more advanced models, we're also deeply investing in **AI** responsibility."

19. **15:04** - Context: Discussing metadata as a tool for identifying AI-generated content.
    > "We'll ensure every one of our **AI**-generated images has that metadata."

20. **15:11** - Context: Referring to a future segment specifically on responsible AI.
    > "James will talk about our responsible approach to **AI** later."

21. **15:30** - Context: Introducing Bard as Google's platform for conversational AI interaction.
    > "...our experiment for conversational **AI**."

Once again, you can check the  [live demo on AI Studio](https://aistudio.google.com/starter-apps/video) shows an example on how to postprocess this output. Check the [code of that demo](https://github.com/google-gemini/starter-applets/tree/main/video) for more details.

# Next Steps

Try with you own videos using the [AI Studio's live demo](https://aistudio.google.com/starter-apps/video) or play with the examples from this notebook (in case you haven't seen, there are other prompts you can try in the dropdowns).

For more examples of the Gemini 2.0 capabilities, check the [Gemini 2.0 folder of the cookbook](https://github.com/google-gemini/cookbook/tree/main/gemini-2/). You'll learn how to use the [Live API](../quickstarts/Get_started_LiveAPI.ipynb), juggle with [multiple tools](../quickstarts/Get_started_LiveAPI_tools.ipynb) or use Gemini 2.0 [spatial understanding](../quickstarts/Spatial_understanding.ipynb) abilities.

The [examples](https://github.com/google-gemini/cookbook/tree/main/examples/) folder from the cookbook is also full of nice code samples illustrating creative ways to use Gemini multimodal capabilities and long-context.