Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

This is the official implementation of OVQA (ICCV 2023). (arxiv)

Dohwan Ko, Ji Soo Lee, Miso Choi, Jaewon Chu, Jihwan Park, Hyunwoo J. Kim.

Department of Computer Science and Engineering, Korea University

(a) Closed-vocabulary Video Question Answering (b) Open-vocabulary Video Question Answering (Ours)

Setup

To install requirements, run:

conda create -n ovqa python=3.8
conda activate ovqa
sh setup.sh

Data Preparation

Download preprocessed data, visual features

Pretrained checkpoint, preprocessed data, and data annotations are provided here. You can download pretrained DeBERTa-v2-xlarge model here.

Then, place the files as follows:

./pretrained
   |─ pretrained.pth
   └─ deberta-v2-xlarge
./meta_data
   |─ activitynet
   │	|─ train.csv
   │	|─ test.csv
   │	|─ train_vocab.json
   │	|─ test_vocab.json
   │	|─ clipvitl14.pth
   │	|─ subtitles.pkl
   │	|─ ans2cat.json
   │	└─ answer_graph
   │       |─ train_edge_index.pth
   │       |─ train_x.pth
   │       |─ test_edge_index.pth
   │       └─ test_x.pth
   │
   |─ msvd
   │	|─ train.csv
   │	|─ test.csv
   │	|─ train_vocab.json
   │	|─ test_vocab.json
   │	|─ clipvitl14.pth
   │	|─ subtitles.pkl
   │	|─ ans2cat.json
   │	└─ answer_graph
   │       |─ train_edge_index.pth
   │       |─ train_x.pth
   │       |─ test_edge_index.pth
   │       └─ test_x.pth
   │  :

Train FrozenBiLM+ on OVQA

To train on ActivityNet-QA, MSVD-QA, TGIF-QA, and MSRVTT-QA, run below command. You can modify --dataset activitynet to change dataset.

python -m torch.distributed.launch --nproc_per_node 4 --use_env train.py --dist-url tcp://127.0.0.1:12345 \
--dataset activitynet --lr 5e-5 --batch_size 8 --batch_size_test 32 --save_dir ./path/to/save/files --epochs 20 --eps 0.7

Acknowledgements

This repo is built upon FrozenBiLM.

Citation

@inproceedings{ko2023open,
  title={Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models},
  author={Ko, Dohwan and Lee, Ji Soo and Choi, Miso and Chu, Jaewon and Park, Jihwan and Kim, Hyunwoo J},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
asset		asset
model		model
util		util
README.md		README.md
args.py		args.py
dataset.py		dataset.py
setup.sh		setup.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

asset

asset

model

model

util

util

README.md

README.md

args.py

args.py

dataset.py

dataset.py

setup.sh

setup.sh

train.py

train.py

Repository files navigation

Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

Setup

Data Preparation

Download preprocessed data, visual features

Train FrozenBiLM+ on OVQA

Acknowledgements

Citation

About

Releases

Packages

Languages

mlvlab/OVQA

Folders and files

Latest commit

History

Repository files navigation

Open-Vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models

Setup

Data Preparation

Download preprocessed data, visual features

Train FrozenBiLM+ on OVQA

Acknowledgements

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages