MLLM Pressure Test: Needle In A Video Haystack

Table of Contents

Introduction
Getting Start
Evaluation Results
Acknowledge
Citation

Introduction

To address existing limitations in long-form video language understanding benchmarks, our work takes inspiration from the latest developments in the field and develops a new benchmark specifically designed for the task of identifying specific content within extensive video material, a challenge we refer to as the Multimodal Needle In A Video Haystack (NIAVH). This benchmark is unique due to its flexibility in supporting various modalities such as text, images, and videos as needles. It accommodates video content of any length, enabling a more comprehensive assessment of a model's video understanding capabilities.

In our benchmark, we utilize ego-centric videos from the Ego4D dataset as the "haystack". Within this haystack, we seek to locate the "needle", which we provide in three distinct modalities. For the textual modality, we supply a crafted description. For the image modality, we employ DALL-E to create an image that visually represents this description. For the video modality, we use Sora to generate a short video clip based on the same description. In each case, the "needle" - whether text, image, or video - is set to a duration of 1 second

Getting Started

Currently support models:

For additional models, please refer to the baselines we have provided for setup.

Installation

GPT4O
- create .env file, then add API_KEY=xxx, API_BASE=xxx (optional)
- install numpy, opencv-python
LLaVA-NeXT-Video:
- install environment following its instruction
- download checkpoint into needlehaystack/baselines/checkpoints/LLaVA-NeXT-Video/LLaVA-NeXT-Video-7B-DPO
PLLaVA
- install environment following its instruction
- download checkpoint into needlehaystack/baselines/checkpoints/PLLaVA/pllava-7b
MA-LMM
- install environment following its instruction
- download checkpoint into needlehaystack/baselines/checkpoints/MA-LMM, including vicuna-7b-v1.1, eva_vit_g.pth, instruct_blip_vicuna7b_trimmed.pth

Install

Install the additional package in the environment of model to be tested:

pip install -r requirements.txt

Run Test

single needle debug

python -m needlehaystack.run --provider GPT4O --video_depth_percents "[50]" --context_lengths "[10]"

pressure test

python -m needlehaystack.run --provider GPT4O

parameters

provider - currently support model: GPT4O, LLaVA-NeXT, PLLaVA, MA-LMM
evaluator_model_name - currently support evaluation API: gpt-35-turbo-0125
needle - needle, could be string, video name, or image
needle_modality - currently support text, image, video
needle_desc - required for image or video needle (question answer)
retrieval_question - required for image or video needle (question)
needle_dir - required for image or video needle (directory to save needle)
haystack_dir - required for image or video haystack (directory to save haystacks)
context_lengths - video context length
video_depth_percents - needle percent
context_lengths_min - The minimum length of the context. Default is 1 seconds.
context_lengths_max - The maximum length of the context. Default is 320 seconds.
context_lengths_num_intervals - The number of intervals for the context length. Default is 40.
video_depth_percent_min - The minimum depth percent of the document. Default is 0.
video_depth_percent_max - The maximum depth percent of the document. Default is 100.
video_depth_percent_intervals - The number of intervals for the document depth percent. Default is 12.

note: you can add more videos into needlehaystack/haystack directory to get longer video

Visualization

use the viz/visualization.ipynb to visualize your result

Evaluation Results

Given the limitations of current methods in understanding long videos, we designed an experiment where the "haystack" is a 320-second video. The "needle" is a 1-second video clip generated by Sora, prompted by the description, "the young man seated on a cloud in the sky is reading a book". The associated question posed for the experiment is, "What is the young man seated on a cloud in the sky doing?". We divided the context into 40 intervals and set the video depth at 12 intervals.

PLLaVA-7B (Run 6/5/2024)

LLaVA-NeXT-Video-DPO-7B (Run 6/5/2024)

MA-LMM (Run 6/5/2024)

Acknowledge

This code is built on LLMTest_NeedleInAHaystack. Many thanks to them for their work.

Citation

If you find our work helpful, please consider citing it.

@misc{mm-niavh,
    title={MLLM Pressure Test: Needle In A Video Haystack},
    author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
    publisher={github},
    url={https://github.com/bigai-nlco/NeedleInAVideoHaystack},
    year={2024}
}

@article{memobridge,
    title={MemoBridge: Perception with Memory Cache in Video-Language Understanding using Recurrent Memory Bridges},
    author={Wang, Yuxuan and Xie, Cihang and Liu, Yang and Zheng, Zilong},
    journal={arxiv},
    year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
assets		assets
needlehaystack		needlehaystack
viz		viz
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MLLM Pressure Test: Needle In A Video Haystack

Introduction

Getting Started

Install

Run Test

Visualization

Evaluation Results

PLLaVA-7B (Run 6/5/2024)

LLaVA-NeXT-Video-DPO-7B (Run 6/5/2024)

MA-LMM (Run 6/5/2024)

Acknowledge

Citation

About

Releases

Packages

Contributors 2

Languages

patrick-tssn/MM-NIAVH

Folders and files

Latest commit

History

Repository files navigation

MLLM Pressure Test: Needle In A Video Haystack

Introduction

Getting Started

Install

Run Test

Visualization

Evaluation Results

PLLaVA-7B (Run 6/5/2024)

LLaVA-NeXT-Video-DPO-7B (Run 6/5/2024)

MA-LMM (Run 6/5/2024)

Acknowledge

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages