<a href="https://colab.research.google.com/github/merveenoyan/smol-vision/blob/main/Gemma_3n_Video_Vibe_Tests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Gemma 3n Video with Audio Inference

In this notebook we'll infer Gemma-3n videos with audios inside.

In [None]:
!pip install -U -q transformers timm datasets

We will load three examples from FineVideo dataset and Gemma-3n model so make sure you have access to both and provide access token.

In [None]:
from huggingface_hub import login
login()

In [None]:
from transformers import AutoProcessor, Gemma3nForConditionalGeneration
import torch
model = Gemma3nForConditionalGeneration.from_pretrained(
    "google/gemma-3n-E4B-it", torch_dtype=torch.bfloat16,
).to("cuda")
processor = AutoProcessor.from_pretrained(
    "google/gemma-3n-E4B-it",
)
processor.tokenizer.padding_side = "right"

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Download video for inference.

In [None]:
!wget https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/IMG_8137.mp4

--2025-07-01 13:39:22--  https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/IMG_8137.mp4
Resolving huggingface.co (huggingface.co)... 18.172.134.4, 18.172.134.24, 18.172.134.124, ...
Connecting to huggingface.co (huggingface.co)|18.172.134.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/7b/14/7b14679bb56cefbf7829be71f3f444110ccc308f431bd8596f534e743367ea5c/6331cbb913feb48349e3b7015a7969e04ce3cd594b1bda7278e4e33fe4a3f5f3?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27IMG_8137.mp4%3B+filename%3D%22IMG_8137.mp4%22%3B&response-content-type=video%2Fmp4&Expires=1751380762&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1MTM4MDc2Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzdiLzE0LzdiMTQ2NzliYjU2Y2VmYmY3ODI5YmU3MWYzZjQ0NDExMGNjYzMwOGY0MzFiZDg1OTZmNTM0ZTc0MzM2N2VhNWMvNjMzMWNiYjkxM2ZlYjQ4MzQ5ZTNiNzAxNWE3OTY5ZTA0Y2UzY2Q1OTRiMWJkYTcyNzhlNGU

Strip audios from video.

In [None]:
import os
import subprocess
filename = "IMG_8137.mp4"
audio_path = os.path.join("audios", f"audio.wav")

subprocess.run([
    "ffmpeg", "-i", filename,
    "-q:a", "0", "-map", "a",
    audio_path,
    "-y"
], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)

CompletedProcess(args=['ffmpeg', '-i', 'IMG_8137.mp4', '-q:a', '0', '-map', 'a', 'audios/audio.wav', '-y'], returncode=0)

In [None]:
import cv2
from PIL import Image
import numpy as np

def downsample_video(video_path):
    vidcap = cv2.VideoCapture(video_path)
    total_frames = int(vidcap.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = vidcap.get(cv2.CAP_PROP_FPS)

    frames = []
    frame_indices = np.linspace(0, total_frames - 1, 7, dtype=int)

    for i in frame_indices:
        vidcap.set(cv2.CAP_PROP_POS_FRAMES, i)
        success, image = vidcap.read()
        if success:
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # Convert from BGR to RGB
            pil_image = Image.fromarray(image)
            timestamp = round(i / fps, 2)
            frames.append((pil_image, timestamp))

    vidcap.release()
    return frames


We will generate descriptions to videos and compare them to irl description in the metadata for the vibecheck.

We need to downsample video to frames.

In [None]:
frames = downsample_video(filename)

In [None]:
frames

[(<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(0.0)),
 (<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(1.03)),
 (<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(2.09)),
 (<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(3.12)),
 (<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(4.17)),
 (<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(5.21)),
 (<PIL.Image.Image image mode=RGB size=1080x1920>, np.float64(6.26))]

In [None]:
messages = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are a helpful assistant."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": f"What is happening in this video? Summarize the events."}]
    }
]
for frame in frames:
    image, timestamp = frame
    messages[1]["content"].append({"type": "text", "text": f"Frame {timestamp}: "})
    image.save(f"image_{timestamp}.png")
    messages[1]["content"].append({"type": "image", "url": f"./image_{timestamp}.png"})
messages[1]["content"].append({"type": "audio", "audio": f"audios/audio.wav"})

In [None]:
messages

[{'role': 'system',
  'content': [{'type': 'text', 'text': 'You are a helpful assistant.'}]},
 {'role': 'user',
  'content': [{'type': 'text',
    'text': 'What is happening in this video? Summarize the events.'},
   {'type': 'text', 'text': 'Frame 0.0: '},
   {'type': 'image', 'url': './image_0.0.png'},
   {'type': 'text', 'text': 'Frame 1.03: '},
   {'type': 'image', 'url': './image_1.03.png'},
   {'type': 'text', 'text': 'Frame 2.09: '},
   {'type': 'image', 'url': './image_2.09.png'},
   {'type': 'text', 'text': 'Frame 3.12: '},
   {'type': 'image', 'url': './image_3.12.png'},
   {'type': 'text', 'text': 'Frame 4.17: '},
   {'type': 'image', 'url': './image_4.17.png'},
   {'type': 'text', 'text': 'Frame 5.21: '},
   {'type': 'image', 'url': './image_5.21.png'},
   {'type': 'text', 'text': 'Frame 6.26: '},
   {'type': 'image', 'url': './image_6.26.png'},
   {'type': 'audio', 'audio': 'audios/audio.wav'}]}]

In [None]:
#processor.tokenizer.padding_side = "right"
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device).to(model.dtype)

In [None]:
inputs["input_ids"].shape[-1]

2087

In [None]:
with torch.inference_mode():
    generation = model.generate(**inputs, max_new_tokens=200, do_sample=False)

The following generation flags are not valid and may be ignored: ['top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


In [None]:
input_len = inputs["input_ids"].shape[-1]

generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)

Here's a summary of what's happening in the video:

The video appears to be taken at a ski resort. The main subject is a person snowboarding down a snowy slope. 

**Initial Scene (0.0 - 1.03):** The snowboarder is initially positioned on the slope, seemingly having fallen or stopped. Other skiers and snowboarders are visible in the background, waiting at what looks like a lift station.

**Mid-Video (1.03 - 6.26):** The snowboarder gets back up and continues down the slope. They navigate past other people, including skiers and snowboarders, and eventually reach a lift station. The video shows the snowboarder interacting with others at the lift, possibly waiting for the lift to start or having just gotten off. There are also other skiers and snowboarders around the lift station.

**End Scene (6.26):** The snowboarder is still at the lift station,
