## Video LLaVA

VQA with a video of making coffee

Model Card:
* https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf

In [2]:
%%script echo skip
!pip install av
!pip install --upgrade transformers

skip


In [1]:
import os
import sys
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)

from utils.dataset import get_iterater_samples_simplified
from utils.metric import calculate_scores
from utils.monitoring import calculate_utilization, format_utilization_narrow, print_utilization

In [2]:
import av
import numpy as np
from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration

def read_video_pyav(container, indices):
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])


model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf", device_map=0)
processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B-hf", device_map=0)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### making-coffee_low-quality.mp4, the lowest quality, 2Mb, 352x240, 503kbps, 30fps

In [5]:
video_path = "videos/making-coffee_low-quality.mp4"
container = av.open(video_path)

# sample uniformly 8 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

In [9]:
prompt = "USER: <video>Describe the process in the video in detail. ASSISTANT:"
inputs = processor(text=prompt, videos=clip, return_tensors="pt").to("cuda")

# generate_ids = model.generate(**inputs, max_length=80)
generate_ids = model.generate(**inputs, max_length=800)
# generate_ids = model.generate(**inputs, max_length=8000)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

utilization = calculate_utilization()
utilization_str = format_utilization_narrow(utilization)
print(
    f"total/used/cuda/res/ram (Gb): {utilization_str['total_memory']}/{utilization_str['memory_used']}/"
    f"{utilization_str['cuda_allocated']}/{utilization_str['cuda_reserved']}/{utilization_str['ram_usage']}"
)

USER: Describe the process in the video in detail. ASSISTANT: In the video, a person is seen pouring milk into a cup using a coffee maker. The person starts by pouring milk into the cup, and then proceeds to add coffee to the cup. The person then uses a spoon to stir the mixture, ensuring that the coffee and milk are well combined. The cup is then placed on a countertop, and the person takes a sip of the beverage. The video showcases the process of making a coffee and milk drink, highlighting the steps involved in preparing the beverage. The person in the video demonstrates the importance of stirring the mixture to ensure that the coffee and milk are well combined, resulting in a smooth and enjoyable drink.
total/used/cuda/res/ram (Gb): 79.15/36.12/27.46/34.76/5.71


### making-coffee_20.mp4, higher quality, 20Mb, 720x480, 5273kbps, 15fps

In [4]:
video_path = "videos/making-coffee_20.mp4"
container = av.open(video_path)

# sample uniformly 8 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

In [6]:
prompt = "USER: <video>What the person did after turning on a coffee machine? ASSISTANT:"
inputs = processor(text=prompt, videos=clip, return_tensors="pt").to("cuda")

# generate_ids = model.generate(**inputs, max_length=80)
generate_ids = model.generate(**inputs, max_length=800)
# generate_ids = model.generate(**inputs, max_length=8000)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
# >>> 'USER:  Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on the bed and reading a book, which is an unusual and amusing sight.'

utilization = calculate_utilization()
utilization_str = format_utilization_narrow(utilization)
print(
    f"total/used/cuda/res/ram (Gb): {utilization_str['total_memory']}/{utilization_str['memory_used']}/"
    f"{utilization_str['cuda_allocated']}/{utilization_str['cuda_reserved']}/{utilization_str['ram_usage']}"
)

USER: What the person did after turning on a coffee machine? ASSISTANT: After turning on the coffee machine, the person poured milk into a cup.Ъ
total/used/cuda/res/ram (Gb): 79.15/34.23/27.46/32.87/6.05


In [11]:
prompt = "USER: <video>Does a person take a sip of beferage and when? ASSISTANT:"
inputs = processor(text=prompt, videos=clip, return_tensors="pt").to("cuda")

# generate_ids = model.generate(**inputs, max_length=80)
generate_ids = model.generate(**inputs, max_length=800)
# generate_ids = model.generate(**inputs, max_length=8000)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
# >>> 'USER:  Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on the bed and reading a book, which is an unusual and amusing sight.'

utilization = calculate_utilization()
utilization_str = format_utilization_narrow(utilization)
print(
    f"total/used/cuda/res/ram (Gb): {utilization_str['total_memory']}/{utilization_str['memory_used']}/"
    f"{utilization_str['cuda_allocated']}/{utilization_str['cuda_reserved']}/{utilization_str['ram_usage']}"
)

USER: Does a person take a sip of beferage and when? ASSISTANT: Yes, a person takes a sip of beverage from the cup after the coffee is done brewing.Ъ
total/used/cuda/res/ram (Gb): 79.15/34.23/27.46/32.87/6.01


In [12]:
prompt = "USER: <video>What was the boiler's pressure during the procedure? ASSISTANT:"
inputs = processor(text=prompt, videos=clip, return_tensors="pt").to("cuda")

# generate_ids = model.generate(**inputs, max_length=80)
generate_ids = model.generate(**inputs, max_length=800)
# generate_ids = model.generate(**inputs, max_length=8000)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
# >>> 'USER:  Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on the bed and reading a book, which is an unusual and amusing sight.'

utilization = calculate_utilization()
utilization_str = format_utilization_narrow(utilization)
print(
    f"total/used/cuda/res/ram (Gb): {utilization_str['total_memory']}/{utilization_str['memory_used']}/"
    f"{utilization_str['cuda_allocated']}/{utilization_str['cuda_reserved']}/{utilization_str['ram_usage']}"
)

USER: What was the boiler's pressure during the procedure? ASSISTANT: The boiler's pressure was 1 bar during the procedure.Ъ
total/used/cuda/res/ram (Gb): 79.15/34.23/27.46/32.87/6.03


In [13]:
prompt = "USER: <video>What was the number on a display of the coffee machine the camera has captured? ASSISTANT:"
inputs = processor(text=prompt, videos=clip, return_tensors="pt").to("cuda")

# generate_ids = model.generate(**inputs, max_length=80)
generate_ids = model.generate(**inputs, max_length=800)
# generate_ids = model.generate(**inputs, max_length=8000)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
# >>> 'USER:  Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on the bed and reading a book, which is an unusual and amusing sight.'

utilization = calculate_utilization()
utilization_str = format_utilization_narrow(utilization)
print(
    f"total/used/cuda/res/ram (Gb): {utilization_str['total_memory']}/{utilization_str['memory_used']}/"
    f"{utilization_str['cuda_allocated']}/{utilization_str['cuda_reserved']}/{utilization_str['ram_usage']}"
)

USER: What was the number on a display of the coffee machine the camera has captured? ASSISTANT: The camera captured a display of the coffee machine that shows the number 100.Ъ
total/used/cuda/res/ram (Gb): 79.15/34.23/27.46/32.87/6.09


In [16]:
prompt = "USER: <video>What is the brand of coffee machine and the model name? ASSISTANT:"
inputs = processor(text=prompt, videos=clip, return_tensors="pt").to("cuda")

# generate_ids = model.generate(**inputs, max_length=80)
generate_ids = model.generate(**inputs, max_length=800)
# generate_ids = model.generate(**inputs, max_length=8000)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
# >>> 'USER:  Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on the bed and reading a book, which is an unusual and amusing sight.'

utilization = calculate_utilization()
utilization_str = format_utilization_narrow(utilization)
print(
    f"total/used/cuda/res/ram (Gb): {utilization_str['total_memory']}/{utilization_str['memory_used']}/"
    f"{utilization_str['cuda_allocated']}/{utilization_str['cuda_reserved']}/{utilization_str['ram_usage']}"
)

USER: What is the brand of coffee machine and the model name? ASSISTANT: The brand of the coffee machine is Breville, and the model name is BES870XL.Ъ
total/used/cuda/res/ram (Gb): 79.15/34.23/27.46/32.87/6.08


In [15]:
prompt = "USER: <video>How much does this coffee machine cost? ASSISTANT:"
inputs = processor(text=prompt, videos=clip, return_tensors="pt").to("cuda")

# generate_ids = model.generate(**inputs, max_length=80)
generate_ids = model.generate(**inputs, max_length=800)
# generate_ids = model.generate(**inputs, max_length=8000)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
# >>> 'USER:  Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on the bed and reading a book, which is an unusual and amusing sight.'

utilization = calculate_utilization()
utilization_str = format_utilization_narrow(utilization)
print(
    f"total/used/cuda/res/ram (Gb): {utilization_str['total_memory']}/{utilization_str['memory_used']}/"
    f"{utilization_str['cuda_allocated']}/{utilization_str['cuda_reserved']}/{utilization_str['ram_usage']}"
)

USER: How much does this coffee machine cost? ASSISTANT: The coffee machine in the video is a Breville coffee machine, and it costs $199.Ъ
total/used/cuda/res/ram (Gb): 79.15/34.23/27.46/32.87/6.16
