Skip to content

open-vision-language/infoseek

Repository files navigation

Dataset Release Page for InfoSeek

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? (EMNLP 2023)

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter and Ming-Wei Chang.

[Project Page] [Annotation] [Images] [Contributed Code] [Leaderboard (Coming Soon)]


InfoSeek, A New VQA Benchmark focuses on Visual Info-Seeking Questions

Please use the following bib entry to cite this paper if you are using any resources from the repo.

@article{chen2023infoseek,
  title={Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?},
  author={Chen, Yang and Hu, Hexiang and Luan, Yi and Sun, Haitian and Changpinyo, Soravit and Ritter, Alan and Chang, Ming-Wei},
  journal={arXiv preprint arXiv:2302.11713},
  year={2023}
}

Introduction

In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training.

InfoSeek Annotation

The annotations are released as jsaonline file for each set and data split as discussed in the paper.

Below is an example of the format for a training data:

{
	"data_id": "infoseek_train_00000000",
	"image_id": "oven_01963180",
	"question": "Which place is this animal endemic to?",
	"answer": ["People's Republic of China"],
	"answer_eval": ["cn", "People's Republic of China", "China", "Mainland China", "China PR", "PR China", "CHN", "CN", "PRC", "\ud83c\udde8\ud83c\uddf3"], 
	"data_split": "train"
}

Here image_id indicates which image files this annotation is associated with (note that InfoSeek images are derived from OVEN). The answer field indicates the most standard language term for the answer. And the answer_eval field is reserved for the evaluation purpose, which includes other acceptable equivalent forms of the answer, to increase the precision of evaluation.

Following are links to each annotation file:

We also release the 6M wikipedia text information (derived from Wikidump 2022/10/01).

To use multimodal Wikipedia information, you would need to download images from the url in the field wikipedia_image_url.

InfoSeek Images

See this guideline for image downloading