In [None]:
!pip install -qU transformers accelerate flash_attn

# Video-Text-to-Text

**Video-text-to-text** models, also known as **video language models** or **vision language models with video input**, are language models that take a video input.

These models have nearly the same architecture as **image-text-to-text** models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies.

Video-text-to-text models are often trained with all vision modalities. Each example may have videos, multiple videos, images, and multiple images. Some of these models can also take interleaved inputs.

Types of video LMs:
* base models used for fine-tuning
* chat fine-tuned models for convservation
* instruction fine-tuned models

We will focus on inference with an instruction-tuned model, `llava-hf/llava-interleave-qwen-7b-hf`, which can take in interleaved data.

The term **"interleave"** in the model `llalva-interleave-qwen-7b-hf` refers to how the model integrates different types of tokens—typically visual (from an image encoder) and textual (language tokens)—by mixing them together within a single sequence.

Instead of simply appending image tokens before or after the text, the model alternates (or “interleaves”) them. This design allows the model to align and fuse visual and textual information more closely. The interleaving can help the transformer capture fine-grained relationships between the image content and the corresponding text, potentially improving its ability to understand and generate responses based on multi-modal inputs.

In essence, the "interleave" aspect denotes that the processing of multi-modal information is done in an integrated fashion, rather than handling each modality in isolation. This method has been adopted in several multi-modal models (like LLaVA) to enhance the synergy between visual cues and language understanding.

In [None]:
from transformers import LlavaProcessor, LlavaForConditionalGeneration
import torch

model_id = 'llava-hf/llava-interleave-qwen-7b-hf'

processor = LlavaProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16
).to('cuda')

Some models directly consume the `<video>` token, and others accept `<image>` tokens equal to the number of sampled frames. This model handles videos in the latter use case:

In [None]:
import uuid
import requests
import cv2
from PIL import Image


def replace_video_with_images(text, frames):
    return text.replace(
        '<video>'
        '<image>' * frames
    )


def sample_frames(url, num_frames):
    response = requests.get(url)
    path_id = str(uuid.uuid4())

    path = f"./{path_id}.mp4"

    with open(path, 'wb') as f:
        f.write(response.content)

    video = cv2.VideoCapture(path)
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    interval = total_frames // num_frames

    frames = []
    for i in range(total_frames):
        ret, frame = video.read()
        pil_img = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
        if not ret:
            continue
        if i % interval == 0:
            frames.append(pil_img)

    video.release()

    return frames[:num_frames]

In [None]:
# prepare some inputs
video_1 = "https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_1.mp4"
video_2 = "https://huggingface.co/spaces/merve/llava-interleave/resolve/main/cats_2.mp4"

video_1 = sample_frames(video_1, 6)
video_2 = sample_frames(video_2, 6)

videos = video_1 + video_2

videos

Now we can preprocess the inputs.

This model has a prompt template:
1. Put all the sampled frames into one list.
2. Add `assistant` at the end of the prompt to trigger the model to give answer.
3. Preprocess the prompt.

In [None]:
user_prompt = "Are these two cats in these two videos doing the same thing?"

tokens = '<image>' * 12 # 8 frames in each video, w only insert 12 <image> tokens
prompt = "<|im_start|>user" + tokens + f"\n{user_prompt}<|im_end|><|im_start|>assistant"

inputs = processor(
    text=prompt,
    images=videos,
    return_tensors='pt'
).to(model.device, model.dtype)

We can now `generate()` for inference:

In [None]:
output = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=False
)

processor.decode(
    output[0][2:],
    skip_special_tokens=True
)[len(user_prompt)+10:]

The model outputs the question in our input and answer, so we only take the text after the prompt and `assistant` part from the model output.