GPIC: A Giant Permissive Image Corpus for Visual Generation

Keshigeyan Chandrasegaran^*1, Kyle Sargent^*1, Suchir Agarwal¹, Michael Jang¹,
Michael Poli^1,2, Juan Carlos Niebles^1,4, Justin Johnson³, Jiajun Wu¹, Li Fei-Fei¹
¹ Stanford University ² Radical Numerics ³ University of Michigan ⁴ Salesforce Research
^* Equal contribution
📄 arXiv | 🌎 Website | 🤗 Dataset | 🤗 Models | 🥇 Evaluation toolkit

📣 News

[2026-05-28]: GPIC benchmark dataset released

Abstract

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available on Hugging Face. Evaluation toolkit and code are available at gpic.stanford.edu. We hope GPIC supports open, accessible, and reproducible research on large-scale visual generative modeling.

GPIC Statistics

Dataset Organization

There are 8000 tars for GPIC train, 32 tars for validation, and 128 tars for test.

giant-permissive-image-corpus/
├── train/      (8000 files, gpic_train_{00000–07999}.tar)
├── val/        (32 files,  gpic_val_{00000–0031}.tar)
├── test/       (128 files,  gpic_test_{00000–00127}.tar)
├── .gitattributes
└── README.md

Tar File Format

Each tar archive contains alternating image–metadata pairs:

{key}.json — metadata and caption
{key}.jpg / {key}.png — corresponding image

Files are stored sequentially such that each JSON entry is followed by its corresponding image. For example:

{key_1}.json
{key_1}.jpg
{key_2}.json
{key_2}.png
{key_3}.json
{key_3}.jpg

JSON Format

Each json includes metadata in the following format:

{
    "retrieved_at": str,
    "license": str,
    "license_url": str,
    "attribution": str,
    "key": str,              # unique identifier for the image
    "img_width": int,
    "img_height": int,
    "split": [str],          # dataset split, one of {"nano", "lite", "full"}
    "caption_type": str,     # one of {"tag", "short", "medium", "long"}
    "caption": str
},
// Next image record

Baselines

The baseline code is located in baselines/PixelGen/. Our implementation builds on PixelGen by Ma et al.

Installation

conda create -n pixgen python=3.10
conda activate pixgen
cd baselines/PixelGen
pip install -r requirements.txt

Training

cd baselines/PixelGen
sbatch sbatch_pretrain_gpic_full_jit.sh

Sampling

cd baselines/PixelGen
sbatch sbatch_sample_wl.sh

Update --ckpt_path in the script to point to your checkpoint.

Evaluation

# First install the GPIC evaluation toolkit
cd gpic_eval
pip install -e .
cd ../baselines/PixelGen
bash benchmark_jit_gpic_full.sh

Update PRED_DIRS and REF_NPZ in the script to point to your predictions and reference file.

See baselines/PixelGen/README.md for full details on model architecture and configuration.

GPIC Evaluation Toolkit

The evaluation toolkit is located in gpic_eval/.

Contact

Keshigeyan Chandrasegaran: keshik@cs.stanford.edu
Kyle Sargent: ksarge@cs.stanford.edu

Citation

@misc{chandrasegaran2026gpic,
      title={GPIC: A Giant Permissive Image Corpus for Visual Generation}, 
      author={Keshigeyan Chandrasegaran and Kyle Sargent and Suchir Agarwal and Michael Jang and Michael Poli and Juan Carlos Niebles and Justin Johnson and Jiajun Wu and Li Fei-Fei},
      year={2026},
      eprint={2605.30341},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.30341}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
baselines/PixelGen		baselines/PixelGen
gpic_eval		gpic_eval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPIC: A Giant Permissive Image Corpus for Visual Generation

📣 News

Abstract

GPIC Statistics

Dataset Organization

Tar File Format

JSON Format

Baselines

GPIC Evaluation Toolkit

Contact

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPIC: A Giant Permissive Image Corpus for Visual Generation

📣 News

Abstract

GPIC Statistics

Dataset Organization

Tar File Format

JSON Format

Baselines

GPIC Evaluation Toolkit

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages