##### Copyright 2025 Google LLC.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Video understanding with Gemini

<a target="_blank" href="https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Video_understanding.ipynb"><img src="https://colab.research.google.com/assets/colab-badge.svg" height=30/></a>

Gemini has from the begining been a multimodal model, capable of analyzing all sorts of medias using its [long context window](https://developers.googleblog.com/en/new-features-for-the-gemini-api-and-google-ai-studio/).

[Gemini models](https://ai.google.dev/gemini-api/docs/models/) bring video analysis to a whole new level as illustrated in [this video](https://www.youtube.com/watch?v=Mot-JEU26GQ):


In [None]:
#@title Building with Gemini 2.0: Video understanding
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed/Mot-JEU26GQ?si=pcb7-_MZTSi_1Zkw" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

This notebook will show you how to easily use Gemini to perform the same kind of video analysis. Each of them has different prompts that you can select using the dropdown, also feel free to experiment with your own.

You can also check the [live demo](https://aistudio.google.com/starter-apps/video) and try it on your own videos on [AI Studio](https://aistudio.google.com/starter-apps/video).

## Setup

This section install the SDK, set it up using your [API key](../quickstarts/Authentication.ipynb), imports the relevant libs, downloads the sample videos and upload them to Gemini.

Expand the section if you are curious, but you can also just run it (it should take a couple of minutes since there are large files) and go straight to the examples.

### Install SDK

The new **[Google Gen AI SDK](https://ai.google.dev/gemini-api/docs/sdks)** provides programmatic access to Gemini 2.0 (and previous models) using both the [Google AI for Developers](https://ai.google.dev/gemini-api/docs) and [Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/overview) APIs. With a few exceptions, code that runs on one platform will run on both. This means that you can prototype an application using the Developer API and then migrate the application to Vertex AI without rewriting your code.

More details about this new SDK on the [documentation](https://ai.google.dev/gemini-api/docs/sdks) or in the [Getting started](../quickstarts/Get_started.ipynb) notebook.

In [2]:
%pip install -U -q "google-genai>=1.16.0"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/196.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m194.6/196.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.3/196.3 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25h

### Setup your API key

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see [Authentication](../quickstarts/Authentication.ipynb) for an example.

In [3]:
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

### Initialize SDK client

With the new SDK you now only need to initialize a client with you API key (or OAuth if using [Vertex AI](https://cloud.google.com/vertex-ai)). The model is now set in each call.

In [4]:
from google import genai
from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

### Select the Gemini model

Video understanding works best with Gemini 2.5 models. You can also select former models to compare their behavior but it is recommended to use at least the 2.0 ones.

For more information about all Gemini models, check the [documentation](https://ai.google.dev/gemini-api/docs/models/gemini) for extended information on each of them.


In [5]:
MODEL_ID = "gemini-2.5-flash-preview-05-20" # @param ["gemini-2.5-flash-preview-05-20", "gemini-2.5-pro-preview-05-06","gemini-2.0-flash","gemini-2.0-flash-lite"] {"allow-input":true, isTemplate: true}

### Get sample videos

You will start with uploaded videos, as it's a more common use-case, but you will also see later that you can also use Youtube videos.

In [6]:
# Load sample images
!wget https://storage.googleapis.com/generativeai-downloads/videos/Pottery.mp4 -O Pottery.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/Jukin_Trailcam_Videounderstanding.mp4 -O Trailcam.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/post_its.mp4 -O Post_its.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/user_study.mp4 -O User_study.mp4 -q

### Upload the videos

Upload all the videos using the File API. You can find modre details about how to use it in the [Get Started](../quickstarts/Get_started.ipynb#scrollTo=KdUjkIQP-G_i) notebook.

This can take a couple of minutes as the videos will need to be processed and tokenized.

In [7]:
import time

def upload_video(video_file_name):
  video_file = client.files.upload(file=video_file_name)

  while video_file.state == "PROCESSING":
      print('Waiting for video to be processed.')
      time.sleep(10)
      video_file = client.files.get(name=video_file.name)

  if video_file.state == "FAILED":
    raise ValueError(video_file.state)
  print(f'Video processing complete: ' + video_file.uri)

  return video_file

pottery_video = upload_video('Pottery.mp4')
trailcam_video = upload_video('Trailcam.mp4')
post_its_video = upload_video('Post_its.mp4')
user_study_video = upload_video('User_study.mp4')

Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/kf6plbysnobp
Waiting for video to be processed.
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/fy5ajdxy7ayy
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/iate7gdvvcw1
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/phl4zyzf11gr


### Imports

In [8]:
import json
from PIL import Image
from IPython.display import display, Markdown, HTML

# Search within videos

First, try using the model to search within your videos and describe all the animal sightings in the trailcam video.

<video controls width="500"><source src="https://storage.googleapis.com/generativeai-downloads/videos/Jukin_Trailcam_Videounderstanding.mp4" type="video/mp4"></video>

In [9]:
prompt = "For each scene in this video, generate captions that describe the scene along with any spoken text placed in quotation marks. Place each caption into an object with the timecode of the caption in the video."  # @param ["For each scene in this video, generate captions that describe the scene along with any spoken text placed in quotation marks. Place each caption into an object with the timecode of the caption in the video.", "Organize all scenes from this video in a table, along with timecode, a short description, a list of objects visible in the scene (with representative emojis) and an estimation of the level of excitement on a scale of 1 to 10"] {"allow-input":true}

video = trailcam_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        video,
        prompt,
    ]
)

Markdown(response.text)

- **00:00**- The camera lens is covered by brown fur. A "chirp" sound is heard.
- **00:01**- The camera view clears, showing two gray foxes interacting in a rocky, leaf-covered area. One fox walks in from the right, tail first, and sniffs the ground. Another fox enters and joins it, both foraging. "Rustling sounds" "Clicks"
- **00:11**- One fox jumps onto a large rock, looking around, while the other continues to forage below. "Rustling sounds"
- **00:17**- A mountain lion appears in infrared (black and white footage). It walks slowly, sniffing the ground intently, then pauses to look up, before continuing to walk out of frame. "Clicks" "Rustling sounds"
- **00:34**- Two gray foxes are seen at night in infrared. One appears to be digging or scratching at the ground, while the other approaches and seems to lunge playfully at the first one. "Clicks" "Rustling sounds"
- **00:50**- Two gray foxes are visible at night in infrared. One climbs onto a large rock, then jumps off and continues to move around. "Clicks" "Rustling sounds"
- **01:04**- A mountain lion is seen walking away from the camera at night in infrared. "Clicks"
- **01:17**- Two mountain lions appear at night in infrared. The first one walks across the foreground and out of sight. The second one then appears behind it, walking across some rocks and disappearing from view. "Clicks"
- **01:29**- A bobcat is seen at night in infrared. It walks into the frame, pauses to sniff the ground, then walks out of frame to the right. "Clicks"
- **01:51**- A brown-colored black bear walks from right to left across the frame in daylight. "Clicks" "Rustling sounds"
- **01:56**- A mountain lion walks from right to left across the frame in infrared (black and white footage). "Clicks"
- **02:05**- A mother black bear (brown phase) and her cub appear in daylight. The cub walks into the frame first, followed by the mother. They both forage on the ground before walking away from the camera. "Clicks" "Rustling sounds"
- **02:22**- A gray fox is seen at night in infrared, on a ridge overlooking a city illuminated by lights. It sniffs the ground. "Clicks"
- **02:34**- A black bear (brown phase) walks from right to left across the frame at night in infrared, with city lights visible in the background. "Clicks"
- **02:42**- A mountain lion walks from left to right across the frame at night in infrared, with city lights visible in the background. "Clicks"
- **02:51**- A mountain lion is seen at night in infrared. It approaches the camera from behind, then turns around and walks away from the camera. "Clicks"
- **03:04**- A black bear (dark phase) stands in the frame in daylight. It sniffs the air, looks around, then walks off to the left. "Clicks" "Rustling sounds"
- **03:22**- A black bear (brown phase) is seen sniffing the ground in daylight. "Clicks"
- **03:32**- A mother black bear (brown phase) and her cub walk across the frame in daylight. The cub is seen from behind. "Clicks"
- **03:40**- Two black bears (brown phase) are seen foraging on the ground in daylight. "Clicks" "Rustling sounds"
- **04:03**- Two black bears (brown phase) walk towards the camera in daylight, then pass by it. "Clicks"
- **04:22**- A bobcat walks from left to right across the frame at night in infrared. "Clicks"
- **04:30**- A gray fox walks towards the camera at night in infrared. "Clicks"
- **04:49**- A gray fox walks away from the camera at night in infrared. "Clicks"
- **04:57**- A mountain lion approaches the camera, sniffs the ground, and then walks away from the camera at night in infrared. "Clicks" "Rustling sounds"

The prompt used is quite a generic one, but you can get even better results if you cutomize it to your needs (like asking specifically for foxes).

The [live demo on AI Studio](https://aistudio.google.com/starter-apps/video) shows how you can postprocess this output to jump directly to the the specific part of the video by clicking on the timecodes. If you are interested, you can check the [code of that demo on Github](https://github.com/google-gemini/starter-applets/tree/main/video).

# Extract and organize text

Gemini models can also read what's in the video and extract it in an organized way. You can even use Gemini reasoning capabilities to generate new ideas for you.

<video controls width="400"><source src="https://storage.googleapis.com/generativeai-downloads/videos/post_its.mp4" type="video/mp4"></video>

In [10]:
prompt = "Transcribe the sticky notes, organize them and put it in a table. Can you come up with a few more ideas?" # @param ["Transcribe the sticky notes, organize them and put it in a table. Can you come up with a few more ideas?", "Which of those names who fit an AI product that can resolve complex questions using its thinking abilities?"] {"allow-input":true}

video = post_its_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        video,
        prompt,
    ]
)

Markdown(response.text)

Here are the project name ideas transcribed from the sticky notes, organized in a table, followed by a few more ideas:

## Brainstorm: Project Name Ideas

| Project Name Ideas   |
| :------------------- |
| Aether               |
| Andromeda's Reach    |
| Astral Forge         |
| Athena               |
| Athena's Eye         |
| Bayes Theorem        |
| Canis Major          |
| Celestial Drift      |
| Centaurus            |
| Cerberus             |
| Chaos Field          |
| Chaos Theory         |
| Chimera Dream        |
| Comets Tail          |
| Convergence          |
| Delphinus            |
| Draco                |
| Echo                 |
| Equilibrium          |
| Euler's Path         |
| Fractal              |
| Galactic Core        |
| Golden Ratio         |
| Hera                 |
| Infinity Loop        |
| Leo Minor            |
| Lunar Eclipse        |
| Lyra                 |
| Lynx                 |
| Medusa               |
| Odin                 |
| Orion's Belt         |
| Orion's Sword        |
| Pandora's Box        |
| Perseus Shield       |
| Phoenix              |
| Prometheus Rising    |
| Riemann's Hypothesis |
| Sagitta              |
| Serpens              |
| Stellar Nexus        |
| Stokes Theorem       |
| Supernova Echo       |
| Symmetry             |
| Taylor Series        |
| Titan                |
| Vector               |
| Zephyr               |

---

## A Few More Project Name Ideas:

1.  **Quantum Leap**
2.  **Cosmic Weave**
3.  **Ares Vanguard**
4.  **Vortex Protocol**
5.  **Event Horizon**

# Structure information

Gemini 2.0 is not only able to read text but also to reason and structure about real world objects. Like in this video about a display of ceramics with handwritten prices and notes.

<video controls width="500"><source src="https://storage.googleapis.com/generativeai-downloads/videos/Pottery.mp4" type="video/mp4"></video>

In [11]:
prompt = "Give me a table of my items and notes" # @param ["Give me a table of my items and notes", "Help me come up with a selling pitch for my potteries"] {"allow-input":true}

video = pottery_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        video,
        prompt,
    ],
    config = types.GenerateContentConfig(
        system_instruction="Don't forget to escape the dollar signs",
    )
)

Markdown(response.text)

Here's a table summarizing the items and their details from the image:

| Item Type       | Description                                                                                                                                                                                                                             | Quantity | Dimensions                       | Price         | Notes/Glaze Information                                                      |
| :-------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------- | :------------------------------- | :------------ | :--------------------------------------------------------------------------- |
| **Tumblers**    | Ceramic tumblers with a speckled light brown/beige base and a distinct horizontal line where a second, lighter bluish/white glaze has been applied over the top half. The bottom portion appears unglazed.                                     | 7        | ~4"h x 3"d                       | \$20 each     | Glaze: #5 Artichoke double dip                                               |
| **Small Bowls** | Round ceramic bowls with a reddish-brown, speckled clay body and a dark, glossy glaze that shows hints of metallic or greenish tones, particularly around the rim.                                                                     | 2        | 3.5"h x 6.5"d                    | \$35 each     |                                                                              |
| **Medium Bowls**| Larger and deeper round ceramic bowls, similar in clay body and glaze to the small bowls but with more prominent greenish/bluish variations in the dark, glossy glaze, especially on the inner rim.                                     | 3        | 4"h x 7"d                        | \$40 each     |                                                                              |
| **Glaze Test Tile** | Small, irregular ceramic pieces; one is reddish-brown, the other displays a mottled green/blue/gray glaze over reddish-brown. This likely represents the "#5 Artichoke double dip" glaze used on the tumblers. (Not for sale as an item) | 2        | N/A                              | N/A           | Glaze sample: #5 Artichoke double dip                                        |
| **Glaze Test Tile** | Small, rectangular ceramic piece with a textured surface, showing a dark brown/black base with lighter brown specks and a distinct area of bluish/white glaze on one side. Marked "6rb." (Not for sale as an item)                       | 1        | N/A                              | N/A           | Glaze sample: #6 Gemini double dip SLOW COOL. Marked "6rb" (possibly a batch/kiln number). |

As you can see, Gemini is able to grasp to with item corresponds each note, including the last one.

# Analyze screen recordings for key moments

You can also use the model to analyze screen recordings. Let's say you're doing user studies on how people use your product, so you end up with lots of screen recordings, like this one, that you have to manually comb through.
With just one prompt, the model can describe all the actions in your video.

<video controls width="400"><source src="https://storage.googleapis.com/generativeai-downloads/videos/user_study.mp4" type="video/mp4"></video>

In [12]:
prompt = "Generate a paragraph that summarizes this video. Keep it to 3 to 5 sentences with corresponding timecodes." # @param ["Generate a paragraph that summarizes this video. Keep it to 3 to 5 sentences with corresponding timecodes.", "Choose 5 key shots from this video and put them in a table with the timecode, text description of 10 words or less, and a list of objects visible in the scene (with representative emojis).", "Generate bullet points for the video. Place each bullet point into an object with the timecode of the bullet point in the video."] {"allow-input":true}

video = user_study_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=MODEL_ID,
    contents=[
        video,
        prompt,
    ]
)

Markdown(response.text)

This video demonstrates the functionalities of a "My Garden App" for purchasing plants. It begins by showcasing a list of various plants with their descriptions and prices (0:00-0:09). Users can interact with each plant by clicking a "Like" button, which turns red when activated, or an "Add to Cart" button, which confirms the item addition (0:09-0:25).

The video illustrates adding a Fern, Cactus, and Hibiscus to the cart. Navigating to the "Cart" tab displays the selected items along with their individual prices and a total cost of $58.97 (0:30-0:33). The "Profile" tab then summarizes user activity, showing the number of liked plants and cart items (0:33-0:35). The user then returns to the home screen and continues browsing and adding more items (0:37-0:45).

# Analyze youtube videos

On top of using your own videos you can also ask Gemini to get a video from Youtube and analyze it. He's an example using the keynote from Google IO 2023. Guess what the main theme was?


In [13]:
response = client.models.generate_content(
    model=MODEL_ID,
    contents=types.Content(
        parts=[
            types.Part(text="Find all the instances where Sundar says \"AI\". Provide timestamps and broader context for each instance."),
            types.Part(
                file_data=types.FileData(file_uri='https://www.youtube.com/watch?v=ixRanV-rdAQ')
            )
        ]
    )
)

Markdown(response.text)

Here are all the instances where Sundar says "AI" in the video, along with timestamps and broader context:

1.  **0:28 - 0:31:** "As you may have heard, **AI** is having a very busy year."
    *   **Context:** Sundar is opening the Google I/O keynote and immediately highlights the significant activity and progress in the field of Artificial Intelligence.

2.  **0:38 - 0:41:** "Seven years into our journey as an **AI**-first company."
    *   **Context:** He references Google's foundational shift seven years prior to prioritize AI, emphasizing that AI is now at an "exciting inflection point."

3.  **0:45 - 0:48:** "We have an opportunity to make **AI** even more helpful for people, for businesses, for communities, for everyone."
    *   **Context:** Sundar articulates the core mission of Google's AI development: to enhance helpfulness on a broad scale.

4.  **0:54 - 0:57:** "We've been applying **AI** to make our products radically more helpful for a while."
    *   **Context:** He establishes that Google has a history of integrating AI into its products to improve their utility.

5.  **0:59 - 1:02:** "With generative **AI**, we are taking the next step."
    *   **Context:** Sundar introduces generative AI as the next frontier for product innovation at Google.

6.  **1:16 - 1:19:** "Let me start with few examples of how generative **AI** is helping to evolve our products, starting with Gmail."
    *   **Context:** He transitions into specific demonstrations of generative AI's application in various Google products.

7.  **1:40 - 1:42:** "Smart Compose led to more advanced writing features powered by **AI**."
    *   **Context:** Discussing how existing features like Smart Compose in Gmail have evolved through AI, leading to new capabilities like "Help me write."

8.  **3:02 - 3:05:** "Since the early days of Street View, **AI** has stitched together billions of panoramic images so people can explore the world from their device."
    *   **Context:** Explaining AI's long-standing role in creating and enhancing Google Maps' Street View experience.

9.  **3:13 - 3:17:** "At last year's I/O, we introduced immersive view, which uses **AI** to create a high-fidelity representation of a place, so you can experience it before you visit."
    *   **Context:** Highlighting a specific AI-powered feature in Google Maps that provides detailed virtual exploration.

10. **5:15 - 5:17:** "It was one of our first **AI**-native products."
    *   **Context:** Referring to Google Photos as an early and successful example of a product built from the ground up with AI.

11. **5:40 - 5:42:** "**AI** advancements give us more powerful ways to do this."
    *   **Context:** Speaking about how advancements in AI are enabling more sophisticated photo editing capabilities.

12. **5:47 - 5:51:** "Magic Eraser, launched first on Pixel, uses **AI**-powered computational photography to remove unwanted distractions."
    *   **Context:** Providing a concrete example of an existing AI feature in Google Photos that helps users edit images.

13. **5:57 - 6:00:** "And later this year, using a combination of semantic understanding and generative **AI**, you can do much more with a new experience called Magic Editor."
    *   **Context:** Introducing the next-generation photo editing tool, Magic Editor, built on advanced AI.

14. **7:39 - 7:42:** "These are just a few examples of how **AI** can help you in moments that matter."
    *   **Context:** Summarizing the product demonstrations and reiterating the helpfulness of AI.

15. **7:47 - 7:49:** "And there is so much more we can do to deliver the full potential of **AI** across the products you know and love."
    *   **Context:** Expressing continued optimism and ambition for integrating AI more deeply across Google's entire product suite.

16. **8:23 - 8:26:** "And looking ahead, making **AI** helpful for everyone is the most profound way we will advance our mission."
    *   **Context:** Reiterating the central theme of the keynote: making AI accessible and beneficial to all.

17. **8:52 - 8:57:** "And finally, by building and deploying **AI** responsibly, so that everyone can benefit equally."
    *   **Context:** Emphasizing the critical importance of ethical and responsible development and deployment of AI.

18. **9:03 - 9:09:** "Our ability to make **AI** helpful for everyone relies on continuously advancing our foundation models."
    *   **Context:** Connecting the vision of helpful AI to the underlying technological progress in foundation models.

19. **11:25 - 11:33:** "We recently released Sec-PaLM, a version of PaLM 2 fine-tuned for security use cases. It uses **AI** to better detect malicious scripts and can help security experts understand and resolve threats."
    *   **Context:** Illustrating how specialized AI models like Sec-PaLM are being applied to enhance cybersecurity.

20. **12:44 - 12:49:** "PaLM 2 is the latest step in our decade-long journey to bring **AI** in responsible ways to billions of people."
    *   **Context:** Positioning PaLM 2 as a continuation of Google's long-term commitment to making AI widely available and beneficial.

21. **12:58 - 13:03:** "Looking back at the defining **AI** breakthroughs over the last decade, these teams have contributed to a significant number of them."
    *   **Context:** Reflecting on Google's historical contributions to significant AI advancements.

22. **14:09 - 14:11:** "As we invest in more advanced models, we are also deeply investing in **AI** responsibility."
    *   **Context:** Stressing that technological progress in AI is coupled with a strong focus on ethical considerations.

23. **14:18 - 14:19:** "Every one of our **AI**-generated images has that metadata."
    *   **Context:** Explaining the mechanisms (watermarking and metadata) Google is implementing to ensure transparency and accountability for AI-generated content.

24. **15:09 - 15:12:** "James will talk about our responsible approach to **AI** later."
    *   **Context:** Transitioning to another speaker who will delve deeper into Google's ethical framework for AI.

25. **15:28 - 15:31:** "That's the opportunity we have with Bard, our experiment for conversational **AI**."
    *   **Context:** Concluding the segment on products with an introduction to Bard as a public-facing conversational AI tool.

# Customizing video preprocessing

The Gemini API allows you to define some preprocessing steps to enhance your abilities to understand and extract information from videos.

You can use clipping intervals (or define time offsets to focus on specific video parts) and custom FPS (to define how many frames will be considered to analyze the video.

For more details about those features, you can take a look at the [Customizing video preprocessing](https://ai.google.dev/gemini-api/docs/video-understanding#customize-video-preprocessing) at the Gemini API documentation.

## Analyze specific parts of videos using clipping intervals

Sometimes you want to look for specific parts of your videos. You can define time offsets on your request, pointing to the model which specific video interval you are more interested about.

**Note:** The `video_metadata` that you will inform must be representing the time offsets in seconds.

In this example, you are using this video, from [Google I/O 2025 keynote](https://www.youtube.com/watch?v=XEzRZ35urlk) and asking the model to consider specifically the time offset between 20min50s and 26min10s.

In [16]:
response = client.models.generate_content(
    model=MODEL_ID,
    contents=types.Content(
        parts=[
            types.Part(
                file_data=types.FileData(file_uri='https://www.youtube.com/watch?v=XEzRZ35urlk'),
                video_metadata=types.VideoMetadata(
                    start_offset='1250s',
                    end_offset='1570s'
                )
            ),
            types.Part(text='Please summarize the video in 3 sentences.')
        ]
    )
)

Markdown(response.text)

Demis Hassabis of Google DeepMind outlined the company's long-term goal of building Artificial General Intelligence (AGI) responsibly to benefit humanity, showcasing recent advancements like AlphaFold 3 for molecular modeling. He then announced Gemini 1.5 Flash, a new lightweight, fast, and cost-efficient multimodal model designed for high-speed and efficient applications, available now with a 1 million token context window. Finally, Hassabis introduced Project Astra, Google DeepMind's vision for a universal AI agent that can proactively understand and respond to the complex real world in a conversational, teachable, and personal manner, aiming for seamless everyday interaction.

You can also use clipping intervals for videos uploaded to the File API as also inline videos on your prompts (remembering that inline data cannot exceed 20MB in size).

In [19]:
prompt = "Summarize this video in few short bullets"  # @param ["For each scene in this video, generate captions that describe the scene along with any spoken text placed in quotation marks. Place each caption into an object with the timecode of the caption in the video.", "Organize all scenes from this video in a table, along with timecode, a short description, a list of objects visible in the scene (with representative emojis) and an estimation of the level of excitement on a scale of 1 to 10"] {"allow-input":true}

video = trailcam_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=MODEL_ID,
    contents=types.Content(
        parts=[
            types.Part(
                file_data=types.FileData(
                    file_uri=video.uri,
                    mimeType=video.mime_type),
                video_metadata=types.VideoMetadata(
                    start_offset='60s',
                    end_offset='120s'
                )
            ),
            types.Part(text=prompt)
        ]
    )
)

Markdown(response.text)

Here's a summary of the video:

*   Multiple mountain lions (cougars) are captured by a trail camera at night, walking through the forest.
*   In some clips, two mountain lions appear together.
*   A bobcat is seen sniffing the ground and briefly lying down at night.
*   A black bear walks across the frame during the daytime.

## Customize the number of video frames per second (FPS) analyzed

By default, the Gemini API extract 1 (one) FPS to analyze your videos. But this amount may be too much (for videos with less activities, like a lecture) or to preserve more detail in fast-changing visuals, a higher FPS should be selected.

In this scenario, you are using one specific interval of one Nascar pit-stop as also you will capture a higher number of FPS (in this case, 24 FPS).

In [24]:
response = client.models.generate_content(
    model=MODEL_ID,
    contents=types.Content(
        parts=[
            types.Part(
                file_data=types.FileData(file_uri='https://www.youtube.com/watch?v=McN0-DpyHzE'),
                video_metadata=types.VideoMetadata(
                    start_offset='15s',
                    end_offset='35s',
                    fps=24
                )
            ),
            types.Part(text='How many tires where changed? Front tires or rear tires?')
        ]
    )
)

Markdown(response.text)

Based on the video, the pit crew changed **all four tires** on the car:

*   The **left front** and **left rear** tires are changed first (from 00:17 to 00:22).
*   Then, the **right front** and **right rear** tires are changed (from 00:23 to 00:29).

They also refuel the car during this pit stop.

Once again, you can check the  [live demo on AI Studio](https://aistudio.google.com/starter-apps/video) shows an example on how to postprocess this output. Check the [code of that demo](https://github.com/google-gemini/starter-applets/tree/main/video) for more details.

# Next Steps

Try with you own videos using the [AI Studio's live demo](https://aistudio.google.com/starter-apps/video) or play with the examples from this notebook (in case you haven't seen, there are other prompts you can try in the dropdowns).

For more examples of the Gemini capabilities, check the other guide from the [Cookbook](https://github.com/google-gemini/cookbook/). You'll learn how to use the [Live API](../quickstarts/Get_started_LiveAPI.ipynb), juggle with [multiple tools](../quickstarts/Get_started_LiveAPI_tools.ipynb) or use Gemini 2.0 [spatial understanding](../quickstarts/Spatial_understanding.ipynb) abilities.

The [examples](https://github.com/google-gemini/cookbook/tree/main/examples/) folder from the cookbook is also full of nice code samples illustrating creative ways to use Gemini multimodal capabilities and long-context.