Skip to content

jq-zh/olive-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

OLIVE: Open-world Language Instruction for Visual-language Evaluation

This repository provides the download instructions for the OLIVE dataset, introduced in the NAACL'24 paper titled "What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases."

About OLIVE

The OLIVE dataset is a highly diverse, human-corrected multimodal collection designed to simulate the variety and idiosyncrasies of user queries vision-language models (VLMs) face in real-world scenarios. It supports the training and evaluation of VLMs in conditions that more closely resemble their ultimate use cases.

The dataset contains 9,450 images, 30,120 unique instructions, and 47,250 responses. The images are randomly sampled from LAION-Aesthetics. The instructions and responses are generated using ChatGPT and subsequently refined through human curation.

Each image corresponds with five instruction-response pairs. Each pair features a unique response, although instructions may be reused across different pairs. The instruction-response pairs can be broadly categorized into four groups: visual recognition, creative writing, knowledge-based, and elaborated description.

image

The dataset is split into 6,750 instruction-response pairs for training, 6,750 pairs for validation, and the remaining for test.

Visual Recognition Creative Writing Knowledge-Based Elaborated Description Total
Train 1905 1415 1750 1680 6750
Validation 1900 1455 1755 1640 6750
Test_v1.0 7285 5910 6805 7285 27000
Test_v2.0 9145 7220 8595 8790 33750

Note: Test_v1.0 is a subset of Test_v2.0

Download OLIVE

The OLIVE dataset can be downloaded via the following links:

Images: train, validation, test

Annotations: train, validation, test_v1.0, test_v2.0

The annotations are in the following format:

[
  {
    "image": "filename for the image, e.g. -1612194712933037756",
    "category": "category of the instruction-response pair, e.g. visual_recognition",
    "instruction": "task instruction related to the image, e.g. What is the item in the image?",
    "output": "response to the instruction, e.g. The item in the image is a solar sail, which
is a device that is designed to harness the energy from sunlight to propel a spacecraft through
space without the use of fuel. It is a square shaped piece of cloth that acts like a sail and
captures the radiation pressure from the sun to propel the spacecraft forward.",
    "id": "composite index that uniquely identifies each instruction-response pair associated
with a specific image, e.g. res_3_1486, where 3 is the id of the instruction-response pair and
1486 is the unique id for the image",
  },
]

Performance

The table below reports the zero-shot performance of different models on the test_v1.0 split, utilizing CIDEr as the evaluation metric.

Model Vision Encoder Language Model Size CIDEr
BLIP-2 ViT-G FlanT5-XL 4B 5.2
MiniGPT-4 ViT-G Vicuna-7B 8B 1.6
mPLUG-Owl ViT-L LLaMA-7B 7B 4.4
LLaVA ViT-L LLaMA-7B 7B 29.6

Citation

If you find this work useful for your research, please consider citing it.

@inproceedings{tiong2024we,
  title={What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases},
  author={Tiong, Anthony Meng Huat and Zhao, Junqi and Li, Boyang and Li, Junnan and Hoi, Steven CH and Xiong, Caiming},
  booktitle={Proceedings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  year={2024}
}

About

The OLIVE dataset, containing 9,450 images, 30,120 unique instructions, and 47,250 responses, simulates in-the-wild user queries to vision-language models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published