Videocon: Robust Video-Language Alignment via Contrast Captions (Accepted to CVPR 2024)

[Paper] [Project Page] [Demo 🤗][Dataset 🤗] [Model 🤗]

Authors: Hritik Bansal (UCLA), Yonatan Bitton (Google), Idan Szpektor (Google), Kai-Wei Chang (UCLA), Aditya Grover (UCLA)

This repository contains the data and instructions to reproduce the results of the paper "Videocon: Robust Video-Language Alignment via Contrast Captions".

Getting Started

The following steps are relevant for training and evaluating the model.

Creating conda environment

conda create -n videocon python=3.10
conda activate videocon

Install Pytorch

    conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

Install other dependencies

    pip install -r requirements.txt

VideoCon Data

We present the fully processed dataset for training your models on the entailment and natural language explanation generation tasks.

LLM (PaLM-2) Generated Data

🤗 Entailment Data

source: one of MSR-VTT, VaTeX, TEMPO-HL
videopath: path to the video in the source dataset
caption: video caption
neg_caption: PaLM-2 generated caption
split: one of train, val, and test
misalignment: one of the seven misalignments described in the paper
youtube_key: MSR-VTT and VaTeX videos have youtube ids (metadata)

🤗 Feedback Data

source: one of MSR-VTT, VaTeX, TEMPO-HL
videopath: path to the video in the source dataset
caption: video caption
neg_caption: PaLM-2 generated caption
nle: PaLM-2 generated natural language explanation
split: one of train, val, and test
misalignment: one of the seven misalignments described in the paper
youtube_key: MSR-VTT and VaTeX videos have youtube ids (metadata)

Note: Original Dataset Licenses apply to individual source data.

Downloading Videos

We provide detailed steps to download the source dataset videos in the individual README files in the datasets folder.

Human Generated Data

🤗 VideoCon-Human

We collect the video-caption pairs from the validation set of the ActivityNet dataset.

video_url: s3 link to the video
caption: caption associated with the video 
neg_caption: human-written negative caption
nle: human-written natural language explanation
hard: True or False (see the definition of Human-Hard in our paper)

Finetuning

We finetune mPLUG-Owl-7B-Video from this repo using Low-Rank Adaptation (LoRA).

Specifically, the model is finetuned on the entailment and natural language explanation generation task together. Firstly, we need to process the data files to format the data akin to the mPLUG-Owl-7B-Video training.

Data Processing

Change the videopath in the videocon_llm_entailment.csv and videocon_llm_human.csv such that the paths point to the videos in your local machine.
Run the following command to create the entailment task prompt:

python src/prepare_data_for_train.py --input_csv data/videocon_llm_entailment.csv --output_csv data/train_llm_entailment.csv --entailment

It will generate three files -- train, val, test. 3. Run the following command to create the feedback task prompt:

python src/prepare_data_for_train.py --input_csv data/videocon_llm_feedback.csv --output_csv data/train_llm_feedback.csv --feedback

It will generate three files -- train, val, test. 4. Now merge these files before we can start finetuning the model.

python src/merge.py

This will create data/train_llm_mix_entail_feedback.csv, data/val_llm_mix_entail_feedback.csv, data/test_llm_mix_entail_feedback.csv.

LLM Prompts for Contrast Caption Generation

We add the prompts for generating contrast captions from PaLM-2 in misalignment_prompts.py.
The prompts will work well with other LLM API too.
Example Code for PaLM-2 is present in this colab notebook. You will need to create a project on google console first.

Setup

Download mPLUG-Owl-7B-Video pretrained checkpoint in your local machine.
Add the data file paths and mplug-owl-7b path to video.yaml.
Add save path, experiment name, and nproc_per_node, path to mplug-owl-7b, and CUDA_VISIBLE_DEVICES in train_it.sh script.
Run the following command to launch the training

bash train_it.sh

You would find the finetuned checkpoints in your SAVE_PATH.

Pretrained Checkpoint

Our finetuned VideoCon model is present 🤗 here.

Evaluation

Download the mPLUG-Owl-7B-Video and Owl-Con in your local machine. Their paths are necessary for evaluation.

Custom Inference

Entailment Inference

Create a csv with two columns: videopath and text. Example csv is here
Run the following command to embed the entailment prompt to the text field:

    python src/prepare_data_for_inference.py --input_csv examples/test.csv --output_csv examples/final_test.csv

Run the following command to get the scores for the video and text in the final_test.csv using entailment_inference script.

CUDA_VISIBLE_DEVICES=0 python entailment_inference.py --input_csv ../../examples/final_test.csv --output_csv ../../examples/final_test_scores.csv --trained_ckpt <path to pytorch.bin of videocon ckpt> --pretrained_ckpt <path to mplugowl-7b-video folder> --use_lora --all-params

This will save the entailment scores as an additional column in the final_test_scores.csv.

(Optional) Remove --use_lora and --trained_ckpt argument from the above to use the pretrained model to perform the entailment task.

Entailment Evaluation (ROC-AUC)

It is straightforward to calculate ROC-AUC score using the Custom Inference code discussed above.
Firstly, you will need to convert your data into a csv with videopath and caption using 1. and 2. from the above section.
Run the command to get the entailment score for every videopath and caption.
You can write your logic to assign a caption a label = 1 if it is grounded in the video, otherwise label = 0
Use the roc_auc_score in the sklearn here. Here, the predicted score will be the model's entailment score.

NLE Inference

Create a csv with two columns: videopath and neg_caption. Example csv is here
Run the following command to get the generated NLE using nle_inference script.

CUDA_VISIBLE_DEVICES=0 python nle_inference.py --input_file ../../examples/test_nle.csv --output_file ../../examples/final_test_nle.csv --pretrained_ckpt <path to mplugowl-7b-video folder> --trained_ckpt  <path to pytorch.bin of videocon ckpt> --use_lora --all_params

This will save the generated NLE in the final_test_nle.csv.

NLE Evaluation

In our work, we propose two methods which achieve high agreement with human evaluation.

LLM Prompt

We use the prompt in nle_eval_prompt to get the LLM (PaLM2) decision.
Replace c1 with positive caption, c2 with negative caption, c3 with ground-truth NLE, and c4 with Owl-Con generated NLE.
Note: the prompt should work well with any other LLM API.

Q^2

We use this script to get the ENTAILMENT SCORE.
We set the premise as the ground-truth feedback and hypothesis is the model generated NLE.

Text to Video Retrieval (SSv2)

We download the videos from here. If there are any issues with the download, feel free to email me.

SSv2-Temporal

We provide the SSv2-Temporal data in eval_ssv2_temporal.csv. Here, each caption has 216 candidate videos. The number of comparisons will be 216 * 18 (query actions).
You can use the above file directly to get the entailment scores from our finetuned model using

CUDA_VISIBLE_DEVICES=0 python entailment_inference.py --input_csv datasets/eval_ssv2_temporal.csv --output_csv eval_ssv2_temporal_scores.csv --trained_ckpt <path to pytorch.bin of videocon ckpt> --pretrained_ckpt <path to mplugowl-7b-video folder> --use_lora --all-params

(Optional) Remove --use_lora and --trained_ckpt argument from the above to use the pretrained model to perform the entailment task.

It will generate an output file just like eval_ssv2_temporal_scores.csv. Ignore the values in the last two columns. The number in the third column is the entailment score. 3. Use the calc_ssv2.py to get the mAP and Recall scores using the following command:

python src/calc_ssv2.py --input_file_1 datasets/eval_ssv2_temporal.csv --input_file_2 datasets/eval_ssv2_temporal_scores.csv --vid_per_caption 216

SSv2-Events

We provide the SSv2-Events data in eval_ssv2_events.csv. Here, each caption has 588 candidate videos. The number of comparisons will be 588 * 49 (query actions).
You can use the above file directly to get the entailment scores from our finetuned model using

CUDA_VISIBLE_DEVICES=0 python entailment_inference.py --input_csv datasets/eval_ssv2_events.csv --output_csv eval_ssv2_events_scores.csv --trained_ckpt <path to pytorch.bin of videocon ckpt> --pretrained_ckpt <path to mplugowl-7b-video folder> --use_lora --all-params

(Optional) Remove --use_lora and --trained_ckpt argument from the above to use the pretrained model to perform the entailment task. It will generate an output file just like eval_ssv2_temporal_scores.csv. Ignore the values in the last two columns. The number in the third column is the entailment score. 3. Use the calc_ssv2.py to get the mAP and Recall scores using the following command:

python src/calc_ssv2.py --input_file_1 datasets/eval_ssv2_events.csv --input_file_2 datasets/eval_ssv2_events_scores.csv --vid_per_caption 588

Video Question Answering (ATP-Hard)

The videos for the dataset are available here i.e., NextQA dataset.
The original NextQA validation set question and answers are present here. The ATP-HARD subset consists of the following indices: here. We present the ATP-Hard here.
We use LLM API to convert the Question-Answer pairs into imperative statements. The prompt for the same is present in atp_hard_prompt.py
The LLM-generated statements are added to atp_hard_statements.csv.
Use the eval_nextqa script to prepare the data for entailment score generation.

python src/create_data_for_eval_nextqa.py --input_csv datasets/nextqa-atphard-statements.csv --output_csv eval-nextqa-atphard.csv --map_json map_vid_vidorID.json (from the nextqa dataset itself) --video_dir <location of the videos>

It will generate a csv like eval-nextqa-atphard.csv 6. Use the Entailment Inference code to generate the entailment scores. It can be used to generate a file like atphard_scores. Ignore the last two columns in this file. 7. Use the following code to get the final accuracies:

python src/eval_atphard.py --input_csv_1 datasets/eval-nextqa-atphard.csv [ground-truth] --input_csv_2 datasets/atphard_scores.csv [prediction] --input_csv_3 datasets/nextqa-atphard.csv

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
datasets		datasets
examples		examples
src		src
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
hf_dataset_viewer.py		hf_dataset_viewer.py
main-fig.png		main-fig.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Videocon: Robust Video-Language Alignment via Contrast Captions (Accepted to CVPR 2024)

Getting Started

VideoCon Data

LLM (PaLM-2) Generated Data

Downloading Videos

Human Generated Data

Finetuning

Data Processing

LLM Prompts for Contrast Caption Generation

Setup

Pretrained Checkpoint

Evaluation

Custom Inference

Entailment Inference

Entailment Evaluation (ROC-AUC)

NLE Inference

NLE Evaluation

LLM Prompt

Q^2

Text to Video Retrieval (SSv2)

SSv2-Temporal

SSv2-Events

Video Question Answering (ATP-Hard)

About

Releases

Packages

Languages

License

Hritikbansal/videocon

Folders and files

Latest commit

History

Repository files navigation

Videocon: Robust Video-Language Alignment via Contrast Captions (Accepted to CVPR 2024)

Getting Started

VideoCon Data

LLM (PaLM-2) Generated Data

Downloading Videos

Human Generated Data

Finetuning

Data Processing

LLM Prompts for Contrast Caption Generation

Setup

Pretrained Checkpoint

Evaluation

Custom Inference

Entailment Inference

Entailment Evaluation (ROC-AUC)

NLE Inference

NLE Evaluation

LLM Prompt

Q^2

Text to Video Retrieval (SSv2)

SSv2-Temporal

SSv2-Events

Video Question Answering (ATP-Hard)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages