SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Shraman Pramanick, Rama Chellappa, Subhashini Venugopalan
arXiv, 2024
Paper | SPIQA Dataset
TL;DR: we introduce SPIQA (Scientific Paper Image Question Answering), the first large-scale QA dataset specifically designed to interpret complex figures and tables within the context of scientific research articles across various domains of computer science.
- [Sept, 2024] SPIQA has been accepted for publication at NeurIPS 2024 in the Datasets and Benchmarks track.
- [July, 2024] We update instructions to run evaluation with different baselines on all three tasks, and release the responses by baselines to fully reproduce the reported numbers.
- [July, 2024] SPIQA Paper is now up on arXiv.
- [June, 2024] SPIQA is now live on Hugging Faceπ€.
- Instructions to run metric computation scripts.
- Starter code snippet for L3Score.
- Release responses by baselines to fully reproduce the reported numbers.
- Instructions to run evaluation.
The contents of this repository are structured as follows:
spiqa
βββ evals
βββ Evaluation of all open- and closed-source models on test-A
βββ Evaluation of all open- and closed-source models on test-B
βββ Evaluation of all open- and closed-source models on test-C
βββ metrics
βββ Computation of BLEU, ROUGE, CIDEr, METEOR, BERTScore and L3Score
Each directory contains different python scripts to evaluate various models on three different tasks and compute metrics.
SPIQA is publicly available on Hugging Faceπ€.
We recommend the users to download the metadata and images to their local machine.
- Download the whole dataset (all splits).
from huggingface_hub import snapshot_download
snapshot_download(repo_id="google/spiqa", repo_type="dataset", local_dir='.') ### Mention the local directory path- Download specific file.
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="google/spiqa", filename="test-A/SPIQA_testA.json", repo_type="dataset", local_dir='.') ### Mention the local directory pathimport json
testA_metadata = json.load(open('test-A/SPIQA_testA.json', 'r'))
paper_id = '1702.03584v3'
print(testA_metadata[paper_id]['qa'])import json
testB_metadata = json.load(open('test-B/SPIQA_testB.json', 'r'))
paper_id = '1707.07012'
print(testB_metadata[paper_id]['question']) ## Questions
print(testB_metadata[paper_id]['composition']) ## Answersimport json
testC_metadata = json.load(open('test-C/SPIQA_testC.json', 'r'))
paper_id = '1808.08780'
print(testC_metadata[paper_id]['question']) ## Questions
print(testC_metadata[paper_id]['answer']) ## AnswersWe use conda-pack to share the required environment for every baseline model for its greater portability. First, start with downloading the environment tars.
wget http://www.cis.jhu.edu/~shraman/SPIQA/conda_envs_spiqa.tar.gz
tar -xvzf conda_envs_spiqa.tar.gz && rm conda_envs_spiqa.tar.gzActivate individual envs as follows. In the following snippet, we show an example for running the Gemini 1.5 Pro model.
mkdir -p gemini_env
tar -xzf envs/gemini.tar.gz -C gemini_env
source gemini_env/bin/activateFor running the closed-weight models, first provide the API key from corresponding accounts. For example, to run Gemini, fill in the api_key in the scripts genai.configure(api_key=<Your_API_Key>).
cd evals/test-a/closed_models/
python gemini_qa_test-a_evaluation_image+caption.py --response_root <path_to_save_responses> --image_resolution -1 --model_id gemini-1.5-procd evals/test-a/closed_models/
python gemini_qa_test-a_evaluation_image+caption+full_text.py --response_root <path_to_save_responses> --image_resolution -1 --model_id gemini-1.5-procd evals/test-a/closed_models/
python gemini_cot_qa_test-a_evaluation_image+caption.py --response_root <path_to_save_responses> --image_resolution -1 --model_id gemini-1.5-proWe list the URLs/Model IDs of all baselines in the MODEL Zoo. The names of the various scripts clearly indicate the respective tasks, baseline settings, and evaluation splits.
NOTE: To run the SPHINX-v2 baseline model, clone the LLaMA2-Accessory github repository, create an environment following the installation guidelines, and download the SPHINX-v2-1k checkpoint.
To reproduce the results reported in our paper, we provide the outputs of all open- and closed-source models here. Please find the instructions for the metric computation below.
from metrics.llmlogscore.llmlogscore import OpenAIClient
client = OpenAIClient(
model_name='gpt-4o',
api_key=<openai_api_key>,
json_output_path='./saved_output_l3score/',
)
_PROMPT = 'You are given a question, ground-truth answer, and a candidate answer. Question: <question> \nGround-truth answer: <GT> \nCandidate answer: <answer> \n\
Is the semantic meaning of the ground-truth and candidate answers similar? Answer in one word - Yes or No.'
_SUFFIXES_TO_SCORE = [' yes', ' yeah']
_COMPLEMENT_SUFFIXES = [' no']
question = 'Where is Niagara falls located?'
gt = 'Niagara Falls is located on the border between the United States and Canada, specifically between New York State and Ontario Province.'
candidate_answer = 'Niagara Falls is situated on the Niagara River, which connects Lake Erie to Lake Ontario, \
and lies on the international border between the United States (New York State) and Canada (Ontario Province).'
prompt_current = _PROMPT.replace('<question>', question).replace('<GT>', gt).replace('<answer>', candidate_answer)
response, prob_yes = client.call_openai_with_score(
prompt=prompt_current,
suffixes=_SUFFIXES_TO_SCORE,
complement_suffixes=_COMPLEMENT_SUFFIXES,
output_prefix=''
)
print('L3Score: ', prob_yes)
#### >>> L3Score: 0.9999999899999982
wrong_answer = 'Niagara Falls is located on the border between the United States and Mexico, specifically between New York State and Ontario Province.'
prompt_current = _PROMPT.replace('<question>', question).replace('<GT>', gt).replace('<answer>', wrong_answer)
response, prob_yes = client.call_openai_with_score(
prompt=prompt_current,
suffixes=_SUFFIXES_TO_SCORE,
complement_suffixes=_COMPLEMENT_SUFFIXES,
output_prefix=''
)
print('L3Score: ', prob_yes)
#### >>> L3Score: 3.653482080241728e-08This repository is created and maintained by Shraman and Subhashini. Questions and discussions are welcome via spraman3@jhu.edu and vsubhashini@google.com.
We evaluate six different open source models on SPIQA: LLaVA 1.5, InstructBLIP, XGen-MM, InternLM-XC, SPHINX-v2 and CogVLM. We thank the respective authors for releasing the model weights. We are grateful to the colleagues in the Science Assistant team at Google Research for valuable discussions and support to our project.
SPIQA evaluation code and library for L3Score in this Github repository are licensed under a APACHE 2.0 License.
@article{pramanick2024spiqa,
title={SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers},
author={Pramanick, Shraman and Chellappa, Rama and Venugopalan, Subhashini},
journal={NeurIPS},
year={2024}
}
