Multi-Scale Progressive Attention Network for Video Question Answering, ACL 2021.
Zhicheng Guo, Jiaxuan Zhao, Licheng Jiao, Xu Liu, Lingling Li
-
Install the python dependency packages:
pip install -r requirements.txt
-
Download TGIF-QA, MSVD-QA, MSRVTT-QA datasets and edit absolute paths in
preprocess/question_features.py
,preprocess/appearance_features.py
andpreprocess/motion_features.py
upon where you locate your data.
For above three datasets of VideoQA, you can choose 3 options of --dataset
:
tgif-qa
, msvd-qa
and msrvtt-qa
.
For different datasets, you can choose 5 options of --question_type
:
none
, action
, count
, frameqa
and transition
.
-
Download Glove 300D to
preprocess/pretrained/
and process it into a pickle file:python preprocess/txt2pickle.py
-
To extract question features.
For TGIF-QA dataset:
python preprocess/question_features.py --dataset tgif-qa \ --question_type action \ --mode total python preprocess/question_features.py \ --dataset tgif-qa \ --question_type action \ --mode train python preprocess/question_features.py \ --dataset tgif-qa \ --question_type action \ --mode test
For MSVD-QA/MSRVTT-QA dataset:
python preprocess/question_features.py \ --dataset msvd-qa \ --question_type none \ --mode total python preprocess/question_features.py \ --dataset msvd-qa \ --question_type none \ --mode train python preprocess/question_features.py \ --dataset msvd-qa \ --question_type none \ --mode val python preprocess/question_features.py \ --dataset msvd-qa \ --question_type none \ --mode test
-
Download pre-trained 3D-ResNet152 to
preprocess/pretrained/
.You can learn more about this model in the following paper:
"Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs", arXiv preprint, 2020.
-
To extract appearance features:
python preprocess/appearance_features.py \ --gpu_id 0 \ --dataset tgif-qa \ --question_type action \ --feature_type pool5 \ --num_frames 16
-
To extract motion features:
python preprocess/motion_features.py \ --gpu_id 0 \ --dataset tgif-qa \ --question_type action \ --num_frames 16
You can choose the suitable --dataset
and --question_type
to start training:
python train.py \
--dataset tgif-qa \
--question_type action \
--T 2 \
--K 3 \
--num_scale 8 \
--num_frames 16 \
--gpu_id 0 \
--max_epochs 30 \
--batch_size 64 \
--dropout 0.1 \
--model_id 0 \
--use_test \
--use_train
Or, you can run the following command to start training:
sh train_sh/action.sh
You can see the training commands for all datasets and tasks under the train_sh
folder.
You can download our pre-trained models from here.
To evaluate the trained model, run the following command:
sh test_sh/action.sh
You can see the evaluating commands for all datasets and tasks under the test_sh
folder.
@inproceedings{guo2021multi,
title={Multi-scale progressive attention network for video question answering},
author={Guo, Zhicheng and Zhao, Jiaxuan and Jiao, Licheng and Liu, Xu and Li, Lingling},
booktitle={Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
pages={973--978},
year={2021}
}