Keshigeyan Chandrasegaran*1,
Kyle Sargent*1,
Suchir Agarwal1,
Michael Jang1,
Michael Poli1,2,
Juan Carlos Niebles1,4,
Justin Johnson3,
Jiajun Wu1,
Li Fei-Fei1
1 Stanford University
2 Radical Numerics
3 University of Michigan
4 Salesforce Research
* Equal contribution
📄 arXiv |
🌎 Website |
🤗 Dataset |
🤗 Models |
🥇 Evaluation toolkit
- [2026-05-28]: GPIC benchmark dataset released
Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available on Hugging Face. Evaluation toolkit and code are available at gpic.stanford.edu. We hope GPIC supports open, accessible, and reproducible research on large-scale visual generative modeling.
There are 8000 tars for GPIC train, 32 tars for validation, and 128 tars for test.
giant-permissive-image-corpus/
├── train/ (8000 files, gpic_train_{00000–07999}.tar)
├── val/ (32 files, gpic_val_{00000–0031}.tar)
├── test/ (128 files, gpic_test_{00000–00127}.tar)
├── .gitattributes
└── README.md
Each tar archive contains alternating image–metadata pairs:
{key}.json— metadata and caption{key}.jpg/{key}.png— corresponding image
Files are stored sequentially such that each JSON entry is followed by its corresponding image. For example:
{key_1}.json
{key_1}.jpg
{key_2}.json
{key_2}.png
{key_3}.json
{key_3}.jpg
Each json includes metadata in the following format:
{
"retrieved_at": str,
"license": str,
"license_url": str,
"attribution": str,
"key": str, # unique identifier for the image
"img_width": int,
"img_height": int,
"split": [str], # dataset split, one of {"nano", "lite", "full"}
"caption_type": str, # one of {"tag", "short", "medium", "long"}
"caption": str
},
// Next image record
The baseline code is located in baselines/PixelGen/. Our implementation builds on PixelGen by Ma et al.
Installation
conda create -n pixgen python=3.10
conda activate pixgen
cd baselines/PixelGen
pip install -r requirements.txtTraining
cd baselines/PixelGen
sbatch sbatch_pretrain_gpic_full_jit.shSampling
cd baselines/PixelGen
sbatch sbatch_sample_wl.shUpdate --ckpt_path in the script to point to your checkpoint.
Evaluation
# First install the GPIC evaluation toolkit
cd gpic_eval
pip install -e .
cd ../baselines/PixelGen
bash benchmark_jit_gpic_full.shUpdate PRED_DIRS and REF_NPZ in the script to point to your predictions and reference file.
See baselines/PixelGen/README.md for full details on model architecture and configuration.
The evaluation toolkit is located in gpic_eval/.
- Keshigeyan Chandrasegaran: keshik@cs.stanford.edu
- Kyle Sargent: ksarge@cs.stanford.edu
@misc{chandrasegaran2026gpic,
title={GPIC: A Giant Permissive Image Corpus for Visual Generation},
author={Keshigeyan Chandrasegaran and Kyle Sargent and Suchir Agarwal and Michael Jang and Michael Poli and Juan Carlos Niebles and Justin Johnson and Jiajun Wu and Li Fei-Fei},
year={2026},
eprint={2605.30341},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2605.30341},
}
