Can multimodal transformers leverage explicit knowledge in their reasoning?
Existing, primarily unimodal, methods have explored approaches under the paradigm of knowledge retrieval followed by answer prediction, but leave open questions about the quality and relevance of the retrieved knowledge used, and how the reasoning processes over implicit and explicit knowledge should be integrated.
To address these challenges, we propose a - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. Additionally, explicit knowledge integration improves interpretability of model predictions in our analysis.
pip install -r requirements.txt
pip install -e .
We provide pre-processed data, pre-extracted explicit/implicit knowledge here. We build a entity database based on Wikidata. Here is one tutorial about how to write Wikidata queries.
Model | Description | Accuracy | Download |
---|---|---|---|
base_both_knowledge |
base size, both implicit and explicit knowledge | 50.58 | base_both_knowledge.zip |
large_explicit_only |
large size, explicit only | 44.25 | large_explicit_only.zip |
large_both_knowledge |
large size, both implicit and explicit knowledge | 53.09 | large_both_knowledge.zip |
You can specify --model_size
with large
or base
. --use_gpt
means if you use implicit knowledge or not.
python -m torch.distributed.launch --nproc_per_node=16 train_KAT.py \
--train_data /mnt/root/knowledge_reasoning/okvqa/train2014 \
--eval_data /mnt/root/knowledge_reasoning/okvqa/val2014 \
--model_size large \
--lr 0.00003 \
--optim adamw \
--scheduler linear \
--weight_decay 0.01 \
--text_maxlength 64 \
--per_gpu_batch_size 1 \
--n_context 40 \
--total_step 8000 \
--warmup_step 1000 \
--name check_kat \
--checkpoint_dir /mnt/root/checkpoint \
--accumulation_steps 1 \
--use_gpt
python -m torch.distributed.launch --nproc_per_node=1 evaluate_KAT.py \
--train_data /mnt/root/knowledge_reasoning/okvqa/train2014 \
--eval_data /mnt/root/knowledge_reasoning/okvqa/val2014 \
--model_size base \
--text_maxlength 64 \
--per_gpu_batch_size 8 \
--n_context 40 \
--model_path /mnt/root/okvqa_best_models/base_w_gpt3_best_5058 \
--use_gpt
KAT: A Knowledge Augmented Transformer for Vision-and-Language
@inproceedings{gui2021kat,
title={KAT: A Knowledge Augmented Transformer for Vision-and-Language},
author={Gui, Liangke and Wang, Borui and Huang, Qiuyuan and Hauptmann, Alex and Bisk, Yonatan and Gao, Jianfeng},
booktitle={NAACL},
year={2022}
}