Skip to content

keshik6/gpic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPIC: A Giant Permissive Image Corpus for Visual Generation

Keshigeyan Chandrasegaran*1Kyle Sargent*1Suchir Agarwal1Michael Jang1
Michael Poli1,2Juan Carlos Niebles1,4Justin Johnson3Jiajun Wu1Li Fei-Fei1
1 Stanford University   2 Radical Numerics   3 University of Michigan   4 Salesforce Research  
* Equal contribution
📄 arXiv | 🌎 Website | 🤗 Dataset | 🤗 Models | 🥇 Evaluation toolkit

GPIC Dataset Overview

📣 News

  • [2026-05-28]: GPIC benchmark dataset released

Abstract

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available on Hugging Face. Evaluation toolkit and code are available at gpic.stanford.edu. We hope GPIC supports open, accessible, and reproducible research on large-scale visual generative modeling.

GPIC Statistics

GPIC Stats Overview

Dataset Organization

There are 8000 tars for GPIC train, 32 tars for validation, and 128 tars for test.

giant-permissive-image-corpus/
├── train/      (8000 files, gpic_train_{00000–07999}.tar)
├── val/        (32 files,  gpic_val_{00000–0031}.tar)
├── test/       (128 files,  gpic_test_{00000–00127}.tar)
├── .gitattributes
└── README.md

Tar File Format

Each tar archive contains alternating image–metadata pairs:

  • {key}.json — metadata and caption
  • {key}.jpg / {key}.png — corresponding image

Files are stored sequentially such that each JSON entry is followed by its corresponding image. For example:

{key_1}.json
{key_1}.jpg
{key_2}.json
{key_2}.png
{key_3}.json
{key_3}.jpg

JSON Format

Each json includes metadata in the following format:

{
    "retrieved_at": str,
    "license": str,
    "license_url": str,
    "attribution": str,
    "key": str,              # unique identifier for the image
    "img_width": int,
    "img_height": int,
    "split": [str],          # dataset split, one of {"nano", "lite", "full"}
    "caption_type": str,     # one of {"tag", "short", "medium", "long"}
    "caption": str
},
// Next image record

Baselines

The baseline code is located in baselines/PixelGen/. Our implementation builds on PixelGen by Ma et al.

Installation

conda create -n pixgen python=3.10
conda activate pixgen
cd baselines/PixelGen
pip install -r requirements.txt

Training

cd baselines/PixelGen
sbatch sbatch_pretrain_gpic_full_jit.sh

Sampling

cd baselines/PixelGen
sbatch sbatch_sample_wl.sh

Update --ckpt_path in the script to point to your checkpoint.

Evaluation

# First install the GPIC evaluation toolkit
cd gpic_eval
pip install -e .
cd ../baselines/PixelGen
bash benchmark_jit_gpic_full.sh

Update PRED_DIRS and REF_NPZ in the script to point to your predictions and reference file.

See baselines/PixelGen/README.md for full details on model architecture and configuration.

GPIC Evaluation Toolkit

The evaluation toolkit is located in gpic_eval/.

Contact

Citation

@misc{chandrasegaran2026gpic,
      title={GPIC: A Giant Permissive Image Corpus for Visual Generation}, 
      author={Keshigeyan Chandrasegaran and Kyle Sargent and Suchir Agarwal and Michael Jang and Michael Poli and Juan Carlos Niebles and Justin Johnson and Jiajun Wu and Li Fei-Fei},
      year={2026},
      eprint={2605.30341},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.30341}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors