Skip to content

Pressure Testing Large Video-Language Models (LVLM): Doing multimodal retrieval from LVLM at any video lengths to measure accuracy

Notifications You must be signed in to change notification settings

patrick-tssn/MM-NIAVH

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MLLM Pressure Test: Needle In A Video Haystack

image

Table of Contents

Introduction

To address existing limitations in long-form video language understanding benchmarks, our work takes inspiration from the latest developments in the field and develops a new benchmark specifically designed for the task of identifying specific content within extensive video material, a challenge we refer to as the Multimodal Needle In A Video Haystack (NIAVH). This benchmark is unique due to its flexibility in supporting various modalities such as text, images, and videos as needles. It accommodates video content of any length, enabling a more comprehensive assessment of a model's video understanding capabilities.

In our benchmark, we utilize ego-centric videos from the Ego4D dataset as the "haystack". Within this haystack, we seek to locate the "needle", which we provide in three distinct modalities. For the textual modality, we supply a crafted description. For the image modality, we employ DALL-E to create an image that visually represents this description. For the video modality, we use Sora to generate a short video clip based on the same description. In each case, the "needle" - whether text, image, or video - is set to a duration of 1 second

Getting Started

Currently support models:

For additional models, please refer to the baselines we have provided for setup.

Installation

  • GPT4O
    • create .env file, then add API_KEY=xxx, API_BASE=xxx (optional)
    • install numpy, opencv-python
  • LLaVA-NeXT-Video:
    • install environment following its instruction
    • download checkpoint into needlehaystack/baselines/checkpoints/LLaVA-NeXT-Video/LLaVA-NeXT-Video-7B-DPO
  • PLLaVA
    • install environment following its instruction
    • download checkpoint into needlehaystack/baselines/checkpoints/PLLaVA/pllava-7b
  • MA-LMM
    • install environment following its instruction
    • download checkpoint into needlehaystack/baselines/checkpoints/MA-LMM, including vicuna-7b-v1.1, eva_vit_g.pth, instruct_blip_vicuna7b_trimmed.pth

Install

Install the additional package in the environment of model to be tested:

pip install -r requirements.txt

Run Test

single needle debug

python -m needlehaystack.run --provider GPT4O --video_depth_percents "[50]" --context_lengths "[10]"

pressure test

python -m needlehaystack.run --provider GPT4O

parameters

  • provider - currently support model: GPT4O, LLaVA-NeXT, PLLaVA, MA-LMM
  • evaluator_model_name - currently support evaluation API: gpt-35-turbo-0125
  • needle - needle, could be string, video name, or image
  • needle_modality - currently support text, image, video
  • needle_desc - required for image or video needle (question answer)
  • retrieval_question - required for image or video needle (question)
  • needle_dir - required for image or video needle (directory to save needle)
  • haystack_dir - required for image or video haystack (directory to save haystacks)
  • context_lengths - video context length
  • video_depth_percents - needle percent
  • context_lengths_min - The minimum length of the context. Default is 1 seconds.
  • context_lengths_max - The maximum length of the context. Default is 320 seconds.
  • context_lengths_num_intervals - The number of intervals for the context length. Default is 40.
  • video_depth_percent_min - The minimum depth percent of the document. Default is 0.
  • video_depth_percent_max - The maximum depth percent of the document. Default is 100.
  • video_depth_percent_intervals - The number of intervals for the document depth percent. Default is 12.

note: you can add more videos into needlehaystack/haystack directory to get longer video

Visualization

use the viz/visualization.ipynb to visualize your result

Evaluation Results

Given the limitations of current methods in understanding long videos, we designed an experiment where the "haystack" is a 320-second video. The "needle" is a 1-second video clip generated by Sora, prompted by the description, "the young man seated on a cloud in the sky is reading a book". The associated question posed for the experiment is, "What is the young man seated on a cloud in the sky doing?". We divided the context into 40 intervals and set the video depth at 12 intervals.

Needle Case

PLLaVA-7B (Run 6/5/2024)

PLLaVA-7B Context Testing

LLaVA-NeXT-Video-DPO-7B (Run 6/5/2024)

LLaVA-NeXT-Video-DPO-7B Context Testing

MA-LMM (Run 6/5/2024)

MALMM Context Testing

Acknowledge

This code is built on LLMTest_NeedleInAHaystack. Many thanks to them for their work.

Citation

If you find our work helpful, please consider citing it.

@misc{mm-niavh,
    title={MLLM Pressure Test: Needle In A Video Haystack},
    author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
    publisher={github},
    url={https://github.com/bigai-nlco/NeedleInAVideoHaystack},
    year={2024}
}
@article{memobridge,
    title={MemoBridge: Perception with Memory Cache in Video-Language Understanding using Recurrent Memory Bridges},
    author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
    journal={arxiv},
    year={2024}
}

About

Pressure Testing Large Video-Language Models (LVLM): Doing multimodal retrieval from LVLM at any video lengths to measure accuracy

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages