Data

Pretraining Dataset

The pretraining dataset used in this release is a subset of CC-3M dataset, filtered with a more balanced concept coverage distribution. Please see here for a detailed description of the dataset structure and how to download the images.

If you already have CC-3M dataset on your disk, the image names follow this format: GCC_train_000000000.jpg. You may edit the image field correspondingly if necessary.

Data	Chat File	Meta Data	Size
CC-3M Concept-balanced 595K	chat.json	metadata.json	211 MB
LAION/CC/SBU BLIP-Caption Concept-balanced 558K	blip_laion_cc_sbu_558k.json	metadata.json	181 MB

Important notice: Upon the request from the community, as ~15% images of the original CC-3M dataset are no longer accessible, we upload images.zip for better reproducing our work in research community. It must not be used for any other purposes. The use of these images must comply with the CC-3M license. This may be taken down at any time when requested by the original CC-3M dataset owner or owners of the referenced images.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data.md

Data.md

Data

Pretraining Dataset

Files

Data.md

Latest commit

History

Data.md

File metadata and controls

Data

Pretraining Dataset