We detail the steps to reproduce our merge_train.pymultilingual instruction mix and evaluation data. Note that most scripts have to be updated with your path to the raw data used by the scripts.
Use of the data has to comply with the licenses of the original datasets used to generate this data.
Translations are produced with NLLB so use has to comply with their license.
- MSCOCO: CC BY 4.0 for annotations, Flickr Terms of Use for images
- BLIP captions (Web CapFilt): BSD 3
- LLaVA: CC BY NC 4.0. It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
- VQAv2: CC BY 4.0
- A-OKVQA Apache 2.0
- ImageNet: Non-Commercial, Babel-ImageNet: BabelNet Non-Commercial License
You will need to download the respective 'raw' data from the websites of MSCOCO (including images), A-OKVQA, LLaVA, ImageNet and the BLIP captions.
- Run in pretrain
filter.py
anddownload_images.py
to sample captions from the full data and download the images. - Run
generate_train.py
to generate a English intermediate file,translate_train.py
to generate the translations, andgenerate_train.py
again for the final data file.
As exactly reproducing our sampling is impossible due to randomness, we include our result after step 1 here. This file also includes image URL which you can use to download the images. As of 06.2023, all links were still available.
To generate the image-caption matching data, first run hard_match.py
to generate the English examples (this takes a while),
translate_match_train.py
to generate the translations, and then generate_match_train.py
to produce the final file.
Run generate_train.py
once to generate an intermediate file, translate_train.py
to generate the translations, and generate_train.py
again for the final data files.
The translation step is not needed for A-OKVQA and the second generate_train.py
is not needed for LLaVA.
For ImageNet examples, you need the label file from https://github.com/gregor-ge/Babel-ImageNet.
Run pretrain/merge_train.py
to combine the different files into one task mix file.
Note for captioning: Both XM3600 and xFlickrCo also generate files used by the pycocoeval library
for evaluation - those files contain coco
in their name.
Download the data from the IGLUE repository along with the images and run the scripts in the folders.
Download the raw data and images from the CrossModal3600 and MaXM repositories and run the respective scripts in the folders.
Clone the POPE repository (https://github.com/AoiDragon/POPE) and then run the respective scripts.