Is a Video worth $n\times n$ Images? A Highly Efficient Approach to Transformer-based Video Question Answering 💡

Chenyang Lyu, Tianbo Ji, Yvette Graham, Jennifer Foster

School of Computing, Dublin City University, Dublin, Ireland 🏫

This repository contains the code for the Efficient-VideoQA system, which is a highly efficient approach for Transformer-based Video Question Answering. The system utilizes existing vision-language pre-trained models and converts video frames into a $n\times n$ matrix, reducing the computational requirements while maintaining the temporal structure of the original video.

1. Introduction 📚

Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one or more image encoders followed by interaction between frames and questions. However, such approach incurs significant memory usage and inevitably slows down the training and inference speed. In this work, we present a highly efficient approach for VideoQA based on existing vision-language pre-trained models. We concatenate video frames into a $n\times n$ matrix and then convert it into one image. By doing so, we reduce the use of the image encoder from $n^{2}$ to $1$ while maintaining the temporal structure of the original video.

2. Dataset 📊

Please download the dataset from this link: https://www.mediafire.com/folder/h14iarbs62e7p/shared including videos and corresponding annotations. Move them under the data/ directory.

Please download the TrafficQA dataset from this link: https://sutdcv.github.io/SUTD-TrafficQA/#/download including videos and corresponding annotations. Move them under the data/ directory.

3. Pre-processing 🔧

To pre-process the data, use data_preprocess.py to extract and combine frames from videos in the MSR-VTT and TrafficQA dataset. Then tokenize the annotation data to tensor dataset.

4. Training 🎓

To train the model, use the following scripts:

For TrafficQA dataset: python run_trafficqa_concat_image.py --do_train --do_eval --num_train_epochs 2 --learning_rate 5e-6 --train_batch_size 8 --eval_batch_size 16 --attention_heads 8 --eval_steps 50
For MSR-VTT dataset: python run_msrvtt_concat_image.py --do_train --do_eval --num_train_epochs 3 --learning_rate 5e-6 --train_batch_size 16 --eval_batch_size 16 --attention_heads 8 --eval_steps 5000

5. Usage 🚀

Once the model is trained, you can use it for VideoQA tasks. Provide a video, and the system will give the most probable answer based on the video. 🔎

6. Dependencies ⚙️

Make sure to install the following dependencies before running the code:

Python (>=3.8) 🐍
PyTorch (>=2.0) 🔥
MoviePy 🧮
ffmpeg 🐼

Citation 📄

If you find our paper useful, please cite it using the bibtex below:

@article{lyu2023video,
  title={Is a Video worth $ n$\backslash$times n $ Images? A Highly Efficient Approach to Transformer-based Video Question Answering},
  author={Lyu, Chenyang and Ji, Tianbo and Graham, Yvette and Foster, Jennifer},
  journal={arXiv preprint arXiv:2305.09107},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
LICENSE		LICENSE
README.md		README.md
data_preprocess.py		data_preprocess.py
modeling.py		modeling.py
run_msrvtt_concat_image.py		run_msrvtt_concat_image.py
run_trafficqa_concat_image.py		run_trafficqa_concat_image.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

LICENSE

LICENSE

README.md

README.md

data_preprocess.py

data_preprocess.py

modeling.py

modeling.py

run_msrvtt_concat_image.py

run_msrvtt_concat_image.py

run_trafficqa_concat_image.py

run_trafficqa_concat_image.py

Repository files navigation

Is a Video worth $n\times n$ Images? A Highly Efficient Approach to Transformer-based Video Question Answering 💡

Table of Contents

1. Introduction 📚

2. Dataset 📊

3. Pre-processing 🔧

4. Training 🎓

5. Usage 🚀

6. Dependencies ⚙️

Citation 📄

About

Releases

Packages

Languages

License

lyuchenyang/Efficient-VideoQA

Folders and files

Latest commit

History

Repository files navigation

Is a Video worth $n\times n$ Images? A Highly Efficient Approach to Transformer-based Video Question Answering 💡

Table of Contents

1. Introduction 📚

2. Dataset 📊

3. Pre-processing 🔧

4. Training 🎓

5. Usage 🚀

6. Dependencies ⚙️

Citation 📄

About

Topics

Resources

License

Stars

Watchers

Forks

Languages