##### Copyright 2024 Google LLC.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Video understanding with Gemini 2.0

<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google-gemini/cookbook/blob/main/quickstarts/Video_understanding.ipynb"><img src="https://github.com/google-gemini/cookbook/blob/main/images/colab_logo_32px.png?raw=1" />Run in Google Colab</a>
  </td>
</table>

Gemini has from the begining been a multimodal model, capable of analyzing all sorts of medias using its [long context window](https://developers.googleblog.com/en/new-features-for-the-gemini-api-and-google-ai-studio/).

[Gemini 2.0](https://ai.google.dev/gemini-api/docs/models/gemini-v2) bring video analysis to a whole new level as illustrated in [this video](https://www.youtube.com/watch?v=Mot-JEU26GQ):


In [2]:
#@title Building with Gemini 2.0: Video understanding
%%html
<iframe width="560" height="315" src="https://www.youtube.com/embed/Mot-JEU26GQ?si=pcb7-_MZTSi_1Zkw" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

This notebook will show you how to easily use Gemini to perform the same kind of video analysis. Each of them has different prompts that you can select using the dropdown, also feel free to experiment with your own.

You can also check the [live demo](https://aistudio.google.com/starter-apps/video) and try it on your own videos on [AI Studio](https://aistudio.google.com/starter-apps/video).

## Setup

This section install the SDK, set it up using your [API key](../quickstarts/Authentication.ipynb), imports the relevant libs, downloads the sample videos and upload them to Gemini.

Expand the section if you are curious, but you can also just run it (it should take a couple of minutes since there are large files) and go straight to the examples.

### Install SDK

The new **[Google Gen AI SDK](https://ai.google.dev/gemini-api/docs/sdks)** provides programmatic access to Gemini 2.0 (and previous models) using both the [Google AI for Developers](https://ai.google.dev/gemini-api/docs) and [Vertex AI](https://cloud.google.com/vertex-ai/generative-ai/docs/overview) APIs. With a few exceptions, code that runs on one platform will run on both. This means that you can prototype an application using the Developer API and then migrate the application to Vertex AI without rewriting your code.

More details about this new SDK on the [documentation](https://ai.google.dev/gemini-api/docs/sdks) or in the [Getting started](../gemini-2/get_started.ipynb) notebook.

In [3]:
%pip install -U -q 'google-genai'

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/130.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.7/130.7 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h

### Setup your API key

To run the following cell, your API key must be stored it in a Colab Secret named `GOOGLE_API_KEY`. If you don't already have an API key, or you're not sure how to create a Colab Secret, see [Authentication](../quickstarts/Authentication.ipynb) for an example.

In [4]:
from google.colab import userdata

GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

### Initialize SDK client

With the new SDK you now only need to initialize a client with you API key (or OAuth if using [Vertex AI](https://cloud.google.com/vertex-ai)). The model is now set in each call.

In [5]:
from google import genai
from google.genai import types

client = genai.Client(api_key=GOOGLE_API_KEY)

### Select the Gemini 2.0 model

Video understanding works best Gemini 2.0 Flash model. You can also select former models to compare their behavior but it is recommended to use the 2.0 one.

For more information about all Gemini models, check the [documentation](https://ai.google.dev/gemini-api/docs/models/gemini) for extended information on each of them.


In [6]:
model_name = "gemini-2.0-flash" # @param ["gemini-1.5-flash-latest","gemini-2.0-flash-lite-preview-02-05","gemini-2.0-flash","gemini-2.0-pro-exp-02-05"] {"allow-input":true, isTemplate: true}

### System instructions

 With the new SDK, the `system_instructions` and the `model` parameters must be passed in all `generate_content` calls, so let's save them to not have to type them all the time.

In [7]:
system_instructions = """
    When given a video and a query, call the relevant function only once with the appropriate timecodes and text for the video
  """


These system instructions make sure the TODO: add explanation

### Get sample videos

In [8]:
# Load sample images
!wget https://storage.googleapis.com/generativeai-downloads/videos/Pottery.mp4 -O Pottery.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/Jukin_Trailcam_Videounderstanding.mp4 -O Trailcam.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/post_its.mp4 -O Post_its.mp4 -q
!wget https://storage.googleapis.com/generativeai-downloads/videos/user_study.mp4 -O User_study.mp4 -q

### Upload the videos

Upload all the videos using the File API. You can find modre details about how to use it in the [Get Started](../gemini-2/get_started.ipynb#scrollTo=KdUjkIQP-G_i) notebook.

This can take a couple of minutes as the videos will need to be processed and tokenized.

In [9]:
import time

def upload_video(video_file_name):
  video_file = client.files.upload(file=video_file_name)

  while video_file.state == "PROCESSING":
      print('Waiting for video to be processed.')
      time.sleep(10)
      video_file = client.files.get(name=video_file.name)

  if video_file.state == "FAILED":
    raise ValueError(video_file.state)
  print(f'Video processing complete: ' + video_file.uri)

  return video_file

pottery_video = upload_video('Pottery.mp4')
trailcam_video = upload_video('Trailcam.mp4')
post_its_video = upload_video('Post_its.mp4')
user_study_video = upload_video('User_study.mp4')

Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/xwsb6bn7ncfe
Waiting for video to be processed.
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/e3yx20tll08p
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/3927bjw2m52g
Waiting for video to be processed.
Video processing complete: https://generativelanguage.googleapis.com/v1beta/files/1pherzdnzfrr


### Imports

In [10]:
import json
from PIL import Image
from IPython.display import display, Markdown, HTML

# Search within videos

First, try using the model to search within your videos and describe all the animal sightings in the trailcam video.

<video controls width="500"><source src="https://storage.googleapis.com/generativeai-downloads/videos/Jukin_Trailcam_Videounderstanding.mp4" type="video/mp4"></video>

In [11]:
prompt = "For each scene in this video, generate captions that describe the scene along with any spoken text placed in quotation marks. Place each caption into an object with the timecode of the caption in the video."  # @param ["For each scene in this video, generate captions that describe the scene along with any spoken text placed in quotation marks. Place each caption into an object with the timecode of the caption in the video.", "Organize all scenes from this video in a table, along with timecode, a short description, a list of objects visible in the scene (with representative emojis) and an estimation of the level of excitement on a scale of 1 to 10"] {"allow-input":true}

video = trailcam_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=model_name,
    contents=[
        video,
        prompt,
    ]
)

Markdown(response.text)

Okay, here are the video captions, described according to the scenes and including the spoken text, in a JSON format:

```json
[
  {
    "00:00": "A close-up of light brown fur, then a grey fox walks into the frame on a forest floor covered with dead leaves, and is soon joined by a second grey fox. They sniff and explore amongst rocks, trees and leaves."
  },
  {
    "00:16": "A black and white nighttime video of a mountain lion walking through a wooded area, sniffs around, then walks away."
  },
  {
    "00:35": "A night vision video of two grey foxes. One is on the ground exploring while the other rolls around on its back. One of the foxes starts to chase the other and they both run out of frame, leaving a small cloud of dust."
  },
  {
    "00:50": "The camera suddenly points up and a flash of light overwhelms the picture, then is redirected to the previous scene, but from another angle. Two grey foxes run up and around rocks, disappear, and the camera returns to the flash of light."
  },
  {
    "01:05": "The camera stabilizes and a mountain lion is in the frame, staring uphill. Another mountain lion comes into frame from behind a rock, walks down, and then walks right past the camera. The mountain lion that was in frame comes down, then the camera shuts off."
  },
  {
    "01:28": "A clear picture of the mountain lion and a similar scene then seen in a black and white recording, the mountain lion sniffing and sniffing around the forest floor."
  },
  {
    "01:51": "A medium shot of a brown bear walking to the camera, getting within a few feet of it, then turning around and walking away."
  },
  {
    "01:56": "A black and white nighttime recording of a mountain lion walking by."
  },
  {
    "02:05": "The camera lens is covered with brown fur, then pulls back to show two bears, again. One is near the camera, another is in the distance."
  },
  {
    "02:23": "A grey fox stands on a hill overlooking city lights, and sniffs around, and looks at the camera."
  },
  {
    "02:35": "A bear walks onto the same hill with the city lights in the distance, sniffs, and keeps walking down, getting close to the camera and overwhelming the picture."
  },
  {
    "02:52": "A mountain lion smells something near a tree."
  },
  {
    "03:05": "A brown bear approaches the camera, looking at it, grumbling, and getting within feet of the camera. It turns around, walks away, and stands up."
  },
  {
    "03:22": "A more blondish bear approaches the camera with the same brown bear, grumbling, and smelling around."
  },
  {
    "03:40": "Both bears are in front of the camera with closeups of fur and backs."
  },
  {
    "04:22": "A lynx sits still on the forest floor, looks around, climbs on a log, and continues to sniff."
  }
]
```
Let me know if you need anything else!

The prompt used is quite a generic one, but you can get even better results if you cutomize it to your needs (like asking specifically for foxes).

The [live demo on AI Studio](https://aistudio.google.com/starter-apps/video) shows how you can postprocess this output to jump directly to the the specific part of the video by clicking on the timecodes. If you are interested, you can check the [code of that demo on Github](https://github.com/google-gemini/starter-applets/tree/main/video).

# Extract and organize text

Gemini can also read what's in the video and extract it in an organized way. You can even use Gemini reasoning capabilities to generate new ideas for you.

<video controls width="400"><source src="https://storage.googleapis.com/generativeai-downloads/videos/post_its.mp4" type="video/mp4"></video>

In [16]:
prompt = "Transcribe the sticky notes, organize them and put it in a table. Can you come up with a few more ideas?" # @param ["Transcribe the sticky notes, organize them and put it in a table. Can you come up with a few more ideas?", "Which of those names who fit an AI product that can resolve complex questions using its thinking abilities?"] {"allow-input":true}

video = post_its_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=model_name,
    contents=[
        video,
        prompt,
    ]
)

Markdown(response.text)

Okay, here's the transcription of the sticky notes, organized into a table, and a few additional ideas added at the end:

**Brainstormed Project Names**

| Category       | Name             |   
|----------------|------------------|
| **Astronomy/Cosmology** | Lunar Eclipse      |
|   | Canis Major      |
|   | Leo Minor        |
|   | Andromeda's Reach|
|   | Convergence      |
|   | Stellar Nexus    |
|   | Orion's Belt    |
|   | Lyra             |
|   | Sagitta         |
|   | Celestial Drift  |
|   | Supernova Echo   |
|   | Astral Forge     |
|   | Comets Tail      |
|   | Delphinus        |
|   | Serpens          |
|   | Centaurus        |
|   | Chaos Field     |
|   | Lynx             |
|   | Draco            |
|   | Perseus Shield   |
|   | Galactic Core    |
|   | Titan            |
|   | Zephyr           |
|   | Echo             |
|   | Vector           |
|   | Orion’s Sword    |
|   | Pandora's Box  |
|   | Fractal          |
| **Mathematics/Physics** | Bayes Theorem     |
|   | Riemann's Hypothesis|
|   | Taylor Series    |
|   | Stokes Theorem   |
|   | Symmetry         |
|   | Golden Ratio     |
|   | Infinity Loop    |
|   | Euler's Path     |
|   | Equilibrium      |
|   | Chaos Theory    |
| **Mythology/Ancient History** | Prometheus Rising  |
|   | Chimera Dream    |
|   | Athena's Eye    |
|   | Medusa           |
|   | Hera             |
|   | Athena           |
|   | Odin             |
|   | Aether           |
|   | Cerberus         |

**Additional Ideas:**

*   **Nebula's Veil**
*   **Singularity Point**
*   **Oracle's Gaze**
*   **Quantum Leap**

# Structure information

Gemini 2.0 is not only able to read text but also to reason and structure about real world objects. Like in this video about a display of ceramics with handwritten prices and notes.

<video controls width="500"><source src="https://storage.googleapis.com/generativeai-downloads/videos/Pottery.mp4" type="video/mp4"></video>

In [17]:
prompt = "Give me a table of my items and notes" # @param ["Give me a table of my items and notes", "Help me come up with a selling pitch for my potteries"] {"allow-input":true}

video = pottery_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=model_name,
    contents=[
        video,
        prompt,
    ],
    config = types.GenerateContentConfig(
        system_instruction="Don't forget to escape the dollar signs",
    )
)

Markdown(response.text)

Okay, here's a table of your items and notes based on the image:

| Item                | Description                                    | Price | Notes                                       |
|---------------------|------------------------------------------------|-------|---------------------------------------------|
| Tumblers            | #5 Artichoke double dip, 4"h x 3"d (ish)     | \$20   |                                             |
| Small Bowls          | 3.5"h x 6.5"d                                  | \$35   |                                             |
| Medium Bowls         | 4"h x 7"d                                    | \$40   |                                             |
| Glaze Sample        | #5 Artichoke double dip                        | N/A    |                                             |
| Glaze Sample        | #6 Gemini double dip, Slow cool               | N/A    |                                             |

As you can see, Gemini is able to grasp to with item corresponds each note, including the last one.

# Analyze screen recordings for key moments

You can also use the model to analyze screen recordings. Let's say you're doing user studies on how people use your product, so you end up with lots of screen recordings, like this one, that you have to manually comb through.
With just one prompt, the model can describe all the actions in your video.

<video controls width="400"><source src="https://storage.googleapis.com/generativeai-downloads/videos/user_study.mp4" type="video/mp4"></video>

In [14]:
prompt = "Generate a paragraph that summarizes this video. Keep it to 3 to 5 sentences with corresponding timecodes." # @param ["Generate a paragraph that summarizes this video. Keep it to 3 to 5 sentences with corresponding timecodes.", "Choose 5 key shots from this video and put them in a table with the timecode, text description of 10 words or less, and a list of objects visible in the scene (with representative emojis).", "Generate bullet points for the video. Place each bullet point into an object with the timecode of the bullet point in the video."] {"allow-input":true}

video = user_study_video # @param ["trailcam_video", "pottery_video", "post_its_video", "user_study_video"] {"type":"raw","allow-input":true}

response = client.models.generate_content(
    model=model_name,
    contents=[
        video,
        prompt,
    ]
)

Markdown(response.text)

Here is a summary of the video in 3-5 sentences:

The video demonstrates a mobile app called "My Garden App."  The user adds a Fern, Cactus, and Hibiscus to the cart by clicking "Like" and "Add to Cart" [00:00:09-00:00:25].  The user navigates to the cart and profile pages to view the items in the cart and the number of liked plants and cart items [00:00:31-00:00:34]. The user adds an Orchid and Snake Plant to the cart, likes the Palm, and removes the Hibiscus from the cart [00:00:41-00:00:48].


Once again, you can check the  [live demo on AI Studio](https://aistudio.google.com/starter-apps/video) shows an example on how to postprocess this output. Check the [code of that demo](https://github.com/google-gemini/starter-applets/tree/main/video) for more details.

# Next Steps

Try with you own videos using the [AI Studio's live demo](https://aistudio.google.com/starter-apps/video) or play with the examples from this notebook (in case you haven't seen, there are other prompts you can try in the dropdowns).

For more examples of the Gemini 2.0 capabilities, check the [Gemini 2.0 folder of the cookbook](https://github.com/google-gemini/cookbook/tree/main/gemini-2/). You'll learn how to use the [Live API](../quickstarts/Get_started_LiveAPI.ipynb), juggle with [multiple tools](../quickstarts/Get_started_LiveAPI_tools.ipynb) or use Gemini 2.0 [spatial understanding](../quickstarts/Spatial_understanding.ipynb) abilities.

The [examples](https://github.com/google-gemini/cookbook/tree/main/examples/) folder from the cookbook is also full of nice code samples illustrating creative ways to use Gemini multimodal capabilities and long-context.