# Generative models for video understanding

Few recent language models have capability to understand videos. These models generalise over a variety of novel use-cases. These models are easy to fine-tune and extend the learning capabilities as well. 

These models typically combine video encoders (often building on CLIP-style frameworks) with large language models, enabling them to process both visual temporal information and generate natural language responses.

In [1]:
import os
from google import genai

In [2]:
gemini_api_key = os.environ['GEMINI_API_KEY']

In [3]:
client = genai.Client(api_key=gemini_api_key)

In [4]:
image_file = client.files.upload(file="./assets/pickleball.jpg")

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=[image_file, "Explain what is happening in the image"]
)

print(response.text)

The image shows a group of people playing pickleball on an indoor court. There are two teams of two people each facing each other across a net in the foreground. A yellow ball is in the air above the net. In the background, there is another pair playing pickleball on a separate court. The court is marked with white lines on a blue surface. There is a mural on the wall that reads "AT SLEEP PICKLE REPEAT." There are also advertisements and signs visible on the walls. In the bottom right corner is a logo for "HiFy."


In [5]:
video_file = client.files.upload(file="./assets/basketball-top-5.mp4")


In [6]:
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=[video_file, "Give me timestamps of every time someone scores a point"]
)

print(response.text)

Here are the timestamps of every time someone scores a point in the provided video.

[00:00:15]
[00:00:25]
[00:00:31]
[00:00:41]
[00:00:54]
[00:01:14]
