The pretraining dataset used in this release is a subset of CC-3M dataset, filtered with a more balanced concept coverage distribution. Please see here for a detailed description of the dataset structure and how to download the images.
If you already have CC-3M dataset on your disk, the image names follow this format: GCC_train_000000000.jpg
. You may edit the image
field correspondingly if necessary.
Data | Chat File | Meta Data | Size |
---|---|---|---|
CC-3M Concept-balanced 595K | chat.json | metadata.json | 211 MB |
LAION/CC/SBU BLIP-Caption Concept-balanced 558K | blip_laion_cc_sbu_558k.json | metadata.json | 181 MB |
Important notice: Upon the request from the community, as ~15% images of the original CC-3M dataset are no longer accessible, we upload images.zip
for better reproducing our work in research community. It must not be used for any other purposes. The use of these images must comply with the CC-3M license. This may be taken down at any time when requested by the original CC-3M dataset owner or owners of the referenced images.