## Cookbook for Video Understanding with Seed-1.8

Seed-1.8 offers strong video understanding and reasoning capabilities including video temporal grounding, spatiotemporal grounding, video reasoning, and long video understanding.

Compared to the Seed1.5-VL, the Seed-1.8 reduces the number of tokens per video frame (min tokens per image is 64), supports a maximum video input of 1280 frames, and optimizes the inference capability for long videos.

Here are a few simple examples to quickly show you how to leverage the Seed1.5-VL for video understanding.

### 0. Setup the code

#### Pre-requisite
- API Key: [Obtain an API key](https://console.volcengine.com/ark/region:ark+cn-beijing/apiKey) for the Seed-1.8 model.
- Model Endpoint: Ensure the base URL of the Ark API.
- Activate the Model: Activate the Seed-1.8 model in [the Ark Console](https://console.volcengine.com/ark/region:ark+cn-beijing/openManagement).
- Install Dependencies: Install required Python packages.

In [1]:
pip install -r ../requirements.txt

Looking in indexes: https://bytedpypi.byted.org/simple/


[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
# SPDX-License-Identifier: Apache-2.0

# Please set the API key here
import os

os.environ['ARK_API_KEY']  = 'your_ark_api_key'
os.environ['ARK_MODEL_ENDPOINT'] = '"doubao-seed-1-8-251215"
base_url = "https://ark.cn-beijing.volces.com/api/v3"

In [3]:
from typing import Optional
from enum import Enum
import os
import base64
import shutil

import cv2
import numpy as np
from openai import OpenAI

from video_processing import process_video

In [4]:
# Please set the API key here
client = OpenAI(
    base_url="https://ark.cn-beijing.volces.com/api/v3",
    api_key=os.environ.get("ARK_API_KEY"),
)

#### Video Processing

In [5]:
def preprocess_video(video_path,
                     sampling_fps=1,
                     max_frames=1280,
                     max_video_length=81920):
    if not os.path.exists(video_path):
        raise ValueError('Video file does not exist.')
    with open(video_path, 'rb') as f:
        video_bytes = f.read()

    video_base64_list, timestamps, _ = process_video(video_bytes, sampling_fps,
                                                     max_frames,
                                                     max_video_length)
    return video_base64_list, timestamps

#### API Messages

In [6]:
def construct_video_message(prompt, video_frames, video_timestamps):
    """
    Construct a message for video understanding.
    """
    video_contents = []
    for image_bytes, timestamp in zip(video_frames, video_timestamps):
        video_contents.append({
            "type": "text",
            "text": f'[{round(timestamp, 1)} second]'
        })
        video_contents.append({
            "type": "image_url",
            "image_url": f"data:image/jpeg;base64,{image_bytes}",
        })
    contents = video_contents + [{"type": "text", "text": prompt}]
    message = [{"role": "user", "content": contents}]
    return message

#### Send request to API to get response

In [7]:
def api_complete(client, messages):
    response = client.chat.completions.create(model=os.environ['ARK_MODEL_ENDPOINT'],
                                              messages=messages,
                                              reasoning_effort="medium",
                                              stream=False,
                                              max_completion_tokens=65536)
    return response.choices[0].message

### 1. General Video Question-Answering

In [8]:
video_path = "samples/demo1.mp4"
text_prompts = "Describe this video in details."
# sampling video frames
sampling_fps = 1
sampled_frames, timestamps = preprocess_video(video_path)
message = construct_video_message(prompt=text_prompts, video_timestamps=timestamps, video_frames=sampled_frames)
result = api_complete(client, message)
response = result.content
reasoning_content = result.reasoning_content
print("Seed-1.8:\n<think>{}</think>\n{}".format(reasoning_content, response))

[video] 22.4 second video, sampling 22 frames, 384 tokens per frame
Seed-1.8:
<think>Got it, let's break this video down step by step, two attempts by the female high jumper.

First attempt:
0.0s-3.0s: The blonde female athlete in a yellow/brown track suit starts her approach run down the blue track, gradually picking up speed.
3.0s-7.0s: She sprints down the runway, maintaining form as she nears the high jump bar.
7.0s-8.5s: She launches into her jump, arching over the bar successfully.
8.5s-11.5s: She lands on the large UCS high jump mat, then stands up and walks off the mat, while an official to the right holds up a red flag (probably indicating... wait no, wait the second attempt, wait no first attempt flag is red? Wait no, let's go to second attempt.

12.0s-16.0s: The athlete starts her second approach run down the track again, getting into sprint rhythm.
16.0s-19.0s: She sprints towards the bar, plants her takeoff foot, and launches into the jump, again clearing the bar with her 

### 2. Video Temporal Grounding
The Seed-1.8 has powerful video temporal grounding capabilities (single segment or multiple segments). Based on user prompts, it can locate the corresponding segments within the video and output them in the format of start and end seconds.

In [9]:
video_path = "samples/demo1.mp4"
text_prompts = "请输出视频中女人每一次跳高的精确起止时间，如果有多次，请分别输出每一次的起止时间"
# sampling video frames
sampling_fps = 2
sampled_frames, timestamps = preprocess_video(video_path, sampling_fps=2)
message = construct_video_message(prompt=text_prompts, video_timestamps=timestamps, video_frames=sampled_frames)
result = api_complete(client, message)
response = result.content
reasoning_content = result.reasoning_content
print("Seed-1.8:\n", "<think>{}</think>\n{}".format(reasoning_content, response))

[video] 22.4 second video, sampling 44 frames, 384 tokens per frame


Seed-1.8:
 <think>用户现在需要知道视频里两次跳高的起止时间，第一次是0.0到8.1秒，第二次是11.8到19.7秒。先看第一次：00:00开始助跑，到00:08完成过杆；第二次00:11开始助跑，00:19完成过杆。整理一下。</think>
第一次跳高：00:00-00:08（从助跑开始到完成过杆落地）
第二次跳高：00:11-00:19（从助跑开始到完成过杆落地）


### 3. Dense Video Captioning
Building upon its temporal grounding capabilities, the model exhibits time perception ability.

In [11]:
video_path = "samples/demo1.mp4"
text_prompts = "Please watch this video carefully and find out all key events in this video, and output the events along with the start/end timestamps using second format."
sampling_fps = 2
sampled_frames, timestamps = preprocess_video(video_path)
message = construct_video_message(prompt=text_prompts, video_timestamps=timestamps, video_frames=sampled_frames)
result = api_complete(client, message)
response = result.content
reasoning_content = result.reasoning_content
print("Seed-1.8:\n<think>{}</think>\n{}".format(reasoning_content, response))

[video] 22.4 second video, sampling 22 frames, 384 tokens per frame


Seed-1.8:
<think>Got it, let's look at the video step by step.

First attempt:
0.0 to 7.0 seconds: The female athlete starts her run-up towards the high jump bar.
7.0 to 10.0 seconds: She jumps over the bar and lands on the mat, then gets up.
10.0 to 12.0 seconds: She walks off the mat after the first attempt.

Second attempt:
12.5 to 18.0 seconds: She starts her second run-up towards the high jump bar.
18.0 to 20.5 seconds: She jumps over the bar again, lands on the mat, and gets up.
20.5 to 22.3 seconds: She walks off the mat after the second attempt.</think>
0.0 - 7.0 second: A female athlete begins her run-up for the high jump attempt.
7.0 - 10.5 second: The athlete jumps over the high jump bar, lands on the mat and stands up.
10.5 - 12.5 second: The athlete walks off the high jump mat after the first try.
12.5 - 18.0 second: The athlete starts her second run-up for another high jump attempt.
18.0 - 20.5 second: The athlete jumps over the high jump bar again, lands on the mat and s

### 4. Video Spatiotemporal Grounding
Seed-1.8 has expanded spatiotemporal grounding capabilities based on Seed-1.5-VL. Given text or image prompts, the model can output the start and end timestamps as well as corresponding localization coordinates (`<bbox>` or `<point>`) of the target in the video.


In [12]:
question_template = "When and where does '{text}' occur? Return temporal and spatial boundaries."
SPATIOTEMPORAL_PROMPT = """Output a JSON list of timestamps (seconds) with associated bounding boxes.

Example:
{
  "9.0 second": [
      {"bbox": "<bbox>71 26 178 239</bbox>"}
  ],
  "10.0 second": [
      {"bbox": "<bbox>63 24 170 239</bbox>"}
  ],
  ...
  "39.0 second": [
      {"bbox": "<bbox>64 3 187 233</bbox>"}
  ],
  "40.0 second": [
      {"bbox": "<bbox>93 19 187 231</bbox>"}
  ]
}

Focus on:
- giving exact timestamps (seconds)
- returning <bbox>x1 y1 x2 y2</bbox> box coordinates
- only including frames where the object clearly appears
"""

In [13]:
video_path = "samples/demo1.mp4"

text = "举红色旗子的男人"
text_prompts = question_template.format(text=text) + "\n"+ SPATIOTEMPORAL_PROMPT
# sampling video frames
sampling_fps = 1
sampled_frames, timestamps = preprocess_video(video_path, sampling_fps=sampling_fps)
message = construct_video_message(prompt=text_prompts, video_timestamps=timestamps, video_frames=sampled_frames)
result = api_complete(client, message)
response = result.content
reasoning_content = result.reasoning_content
print("Seed-1.8:\n<think>{}</think>\n{}".format(reasoning_content, response))

[video] 22.4 second video, sampling 22 frames, 384 tokens per frame
Seed-1.8:
<think>Got it, let's see. Starting from 5.0 second, the man appears:

5.0 second: <bbox>938 605 999 999</bbox>
6.0 second: <bbox>680 577 770 999</bbox>
7.0 second: <bbox>934 608 999 999</bbox>
8.0 second: <bbox>932 598 999 999</bbox>
9.0 second: <bbox>896 566 987 999</bbox>
10.0 second: <bbox>881 564 980 972</bbox>
11.0 second: <bbox>854 564 968 961</bbox>
18.0 second: <bbox>820 577 926 999</bbox>
19.0 second: <bbox>929 575 999 999</bbox>
20.0 second: <bbox>850 580 958 999</bbox>
21.0 second: <bbox>826 569 941 999</bbox>
22.0 second: <bbox>780 556 879 945</bbox></think>
{"5.0 second": [{"bbox": "<bbox>938 605 999 999</bbox>"}], "6.0 second": [{"bbox": "<bbox>680 577 770 999</bbox>"}], "7.0 second": [{"bbox": "<bbox>934 608 999 999</bbox>"}], "8.0 second": [{"bbox": "<bbox>932 598 999 999</bbox>"}], "9.0 second": [{"bbox": "<bbox>896 566 987 999</bbox>"}], "10.0 second": [{"bbox": "<bbox>881 564 980 972</bbox>