Official implementation for the MM'22 paper.
There are two config files in cfgs, individually for the OK-VQA and FVQA datasets. Note that we mainly test our method on the OK-VQA dataset.
- python==3.7
- pytorch==1.10.0
First of all, make sure all the data in the right position according to the config file settings.
- Please download the OK-VQA dataset from the link of the original paper.
- The image features can be found at the LXMERT (If you need only the ViLT model, then skip these features and only download the mscoco images.).
The last step is optional for LXMERT and VisualBert only.
-
Process answers:
python tools/answer_parse_okvqa.py
-
Extract knowledge base with Roberta:
python tools/kb_parse.py
-
Convert image features to h5 (optional):
python tools/detection_features_converter.py
python main.py --name unifer --gpu 0
python main.py --name unifer --test-only
If you found this repo helpful, please consider cite the following paper 👍 :
@inproceedings{unifer,
author = {Yangyang Guo and Liqiang Nie and Yongkang Wong and Yibing Liu and Zhiyong Cheng and Mohan S. Kankanhalli},
title = {A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQA},
booktitle = {ACM Multimedia Conference},
publisher = {ACM},
year = {2022}}