Skip to content
/ KAT Public

Research code for "KAT: A Knowledge Augmented Transformer for Vision-and-Language"

Notifications You must be signed in to change notification settings

guilk/KAT

Repository files navigation

KAT: A Knowledge Augmented Transformer for Vision-and-Language

Introduction

Can multimodal transformers leverage explicit knowledge in their reasoning?

Existing, primarily unimodal, methods have explored approaches under the paradigm of knowledge retrieval followed by answer prediction, but leave open questions about the quality and relevance of the retrieved knowledge used, and how the reasoning processes over implicit and explicit knowledge should be integrated.

To address these challenges, we propose a - Knowledge Augmented Transformer (KAT) - which achieves a strong state-of-the-art result on the open-domain multimodal task of OK-VQA. Our approach integrates implicit and explicit knowledge in an encoder-decoder architecture, while still jointly reasoning over both knowledge sources during answer generation. Additionally, explicit knowledge integration improves interpretability of model predictions in our analysis.

Install

pip install -r requirements.txt
pip install -e .

Pre-processed Data

We provide pre-processed data, pre-extracted explicit/implicit knowledge here. We build a entity database based on Wikidata. Here is one tutorial about how to write Wikidata queries.

Pre-trained models

Model Description Accuracy Download
base_both_knowledge base size, both implicit and explicit knowledge 50.58 base_both_knowledge.zip
large_explicit_only large size, explicit only 44.25 large_explicit_only.zip
large_both_knowledge large size, both implicit and explicit knowledge 53.09 large_both_knowledge.zip

Train

You can specify --model_size with large or base. --use_gpt means if you use implicit knowledge or not.

python -m torch.distributed.launch --nproc_per_node=16 train_KAT.py \
  --train_data /mnt/root/knowledge_reasoning/okvqa/train2014 \
  --eval_data /mnt/root/knowledge_reasoning/okvqa/val2014 \
  --model_size large \
  --lr 0.00003 \
  --optim adamw \
  --scheduler linear \
  --weight_decay 0.01 \
  --text_maxlength 64 \
  --per_gpu_batch_size 1 \
  --n_context 40 \
  --total_step 8000 \
  --warmup_step 1000 \
  --name check_kat \
  --checkpoint_dir /mnt/root/checkpoint \
  --accumulation_steps 1 \
  --use_gpt

TEST

python -m torch.distributed.launch --nproc_per_node=1 evaluate_KAT.py \
  --train_data /mnt/root/knowledge_reasoning/okvqa/train2014 \
  --eval_data /mnt/root/knowledge_reasoning/okvqa/val2014 \
  --model_size base \
  --text_maxlength 64 \
  --per_gpu_batch_size 8 \
  --n_context 40 \
  --model_path /mnt/root/okvqa_best_models/base_w_gpt3_best_5058 \
  --use_gpt

References

KAT: A Knowledge Augmented Transformer for Vision-and-Language

@inproceedings{gui2021kat,
  title={KAT: A Knowledge Augmented Transformer for Vision-and-Language},
  author={Gui, Liangke and Wang, Borui and Huang, Qiuyuan and Hauptmann, Alex and Bisk, Yonatan and Gao, Jianfeng},
  booktitle={NAACL},
  year={2022}
}

Acknowledgements

Our code is built on FiD which is under the LICENSE

About

Research code for "KAT: A Knowledge Augmented Transformer for Vision-and-Language"

Resources

Stars

Watchers

Forks

Packages

 
 
 

Languages