TL;DR — PiD is a plug-and-play diffusion decoder that replaces VAE/RAE decoders, turning latent representations directly into super-resolved pixels in a single pass.
Concise.Version.mp4
PiD reformulates the latent-to-pixel decoder as a conditional pixel-space diffusion model, unifying decoding and upsampling into a single generative module. It directly denoises in high-resolution pixel space and produces a super-resolved image in one pass.
Paper, Project Page, Model Weights
Yifan Lu,
Qi Wu,
Jay Zhangjie Wu,
Zian Wang,
Huan Ling,
Sanja Fidler,
Xuanchi Ren
Tip
Quick Start — if your environment already has PyTorch (with CUDA), transformers>=4.57.x, and diffusers>=0.37, you don't need to build a new conda env. Just install the small set of utility deps the inference code pulls eagerly and you're ready to run the diffusers backbones (flux/flux2/sd3/zimage):
pip install hydra-core==1.3.2 omegaconf==2.3.0 \
attrs einops loguru termcolor fvcore iopath pynvml wandb \
imageio opencv-python-headless pandas \
safetensors "huggingface-hub>=1.0" sentencepiece boto3 botocore
pip install -e .For the dinov2 / siglip backbones you additionally need the upstream RAE / Scale-RAE repos plus a couple of extra packages — see docs/dinov2_siglip.md.
Full conda-managed install (preferred if you're starting from scratch):
conda env create -f environment.yml
conda activate pid
# 2. Install this package in editable mode.
pip install -e .Pretrained PiD checkpoints live under checkpoints/. Each diffusers backbone ships
two variants — the original 2k decoder (trained at 2048px) and a 2kto4k decoder
(trained with multi-resolution data bucketing 2048→3840 + an SD3-style dynamic
shift, intended for 1024 LDM → 4K decoding). Pick the variant at the CLI via
--pid_ckpt_type {2k,2kto4k} (default: 2k).
The released decoder weights and the encoder/decoder ("VAE") weights they
depend on are hosted at nvidia/PiD on
the Hugging Face Hub. Pull just the checkpoints/ tree into this repo:
hf download nvidia/PiD --local-dir . --include "checkpoints/*"PiD ships two complementary entry points per backbone:
| Backbone | from_clean_* (image → encode → PiD) |
from_ldm_* (text/class → LDM → PiD) |
|---|---|---|
| flux | from_clean_flux.py |
from_ldm_flux.py |
| flux2 | from_clean_flux2.py |
from_ldm_flux2.py |
| sd3 | from_clean_sd3.py |
from_ldm_sd3.py |
| zimage | reuses flux |
from_ldm_zimage.py |
| dinov2 | from_clean_dinov2.py |
from_ldm_dinov2.py |
| siglip | from_clean_siglip.py |
from_ldm_siglip.py |
All scripts live under pid/_src/inference/ and decode each captured latent
twice — once with the backbone's native VAE (baseline) and once with PiD.
Important
Picking the checkpoint variant — --pid_ckpt_type
Every entry point accepts --pid_ckpt_type {2k,2kto4k} (default 2k):
2k— the original 2048px-trained decoder.2kto4k— the up-to-4K-resolution decoder. > > Available forflux/flux2/sd3/zimageonly. Worse than2kat 2048px resolution.
For the exact checkpoint path for each backbone, see docs/checkpoints.md.
A quick sanity check that the right variant loaded: when 2kto4k is active you
should see PixelDiT dynamic shift: base_shift=4.0 base_image_size=1024 in the
init log; for 2k that line is absent. Both 2k and 2kto4k support non-square aspect ratios.
Runs the corresponding latent-diffusion backbone on a prompt (or class id for
the class-conditional dinov2 backbone), captures the intermediate x_t at
user-specified denoising steps (early LDM termination) and the final clean x_0, then decodes
each captured latent with both the native VAE / RAE decoder (baseline) and PiD.
For flux / flux2 / sd3 / zimage the LDM is a HuggingFace diffusers
pipeline (FluxPipeline, Flux2Pipeline, StableDiffusion3Pipeline,
ZImagePipeline).
For dinov2 and siglip the LDM is the upstream
RAE (class-conditional ImageNet-512) or
Scale-RAE (text-conditional
256px) repo — see the optional-deps section below for installation.
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
--prompt "A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic" \
--ldm_inference_steps 28 --save_xt_steps 24 \
--output_dir ./results/official_demo/flux \
--cfg_scale 1 --pid_inference_steps 4 --scale 4Same backbone as Example 1 but with --resolution 1024 --pid_ckpt_type 2kto4k,
so the LDM produces a 1024² latent and PiD decodes it to 4K.
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
--prompt "A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic" \
--resolution 1024 --pid_ckpt_type 2kto4k \
--ldm_inference_steps 28 --save_xt_steps 24 \
--output_dir ./results/official_demo/flux_4k \
--cfg_scale 1 --pid_inference_steps 4 --scale 4torchrun shards --prompt_file across ranks; each rank writes to
--output_dir independently.
PYTHONPATH=. torchrun --nproc_per_node=4 \
-m pid._src.inference.from_ldm_zimage \
--prompt_file pid/_src/inference/prompts/prompt_creative.txt \
--ldm_inference_steps 50 --save_xt_steps 46 \
--output_dir ./results/official_demo/zimage \
--cfg_scale 1 --pid_inference_steps 4 --scale 4The upstream RAE / Scale-RAE LDMs don't live in diffusers — see
docs/dinov2_siglip.md for setup and end-to-end
examples.
(See each script's docstring for the exact recipe.)
| Backbone | LDM steps flag | Default steps | --save_xt_steps (example) |
Best --save_xt_steps |
|---|---|---|---|---|
| flux | --ldm_inference_steps |
28 | 22 24 26 |
24 |
| sd3 | --ldm_inference_steps |
28 | 22 24 26 |
24 |
| flux2 | --ldm_inference_steps |
50 | 44 46 48 |
46 |
| zimage | --ldm_inference_steps |
50 | 44 46 48 |
46 |
No latent diffusion model is run. The input image is encode by VAE,
optionally corrupted with Gaussian noise at each
sigma in --degrade_sigmas, then decoded by PiD at --scale * input_resolution.
Single-GPU example (Flux):
PYTHONPATH=. python -m pid._src.inference.from_clean_flux \
--manifest assets/clean_image_manifest.jsonl \
--input_resolution 512 \
--degrade_sigmas 0.0 \
--output_dir ./results/official_demo_from_clean/flux \
--cfg_scale 1 --pid_inference_steps 4 --scale 4You can pass a single image with --input_path and a prompt with --prompt
instead of --manifest, and a sigma sweep such as --degrade_sigmas 0.0 0.2 0.4 0.8
to decode noise-corrupted latents.
The dinov2 / siglip from_clean_* flows take the same flags but with
different default resolutions and scales —
see docs/dinov2_siglip.md.
| Flag | Meaning |
|---|---|
--pid_inference_steps |
Number of denoising steps for PiD (4 for the released distilled checkpoints) |
--scale |
PiD upscale factor (output = baseline * scale); 8 for Scale-RAE and 4 for other backbones |
--cfg_scale |
Classifier-free guidance scale for PiD |
--output_dir |
Where to write the side-by-side comparison images |
--seed |
Base random seed |
Multi-GPU runs use torchrun --nproc_per_node=N; each rank processes a shard
of the prompts / manifest entries and writes to --output_dir independently.
pid/_src/inference/
├── from_ldm_{flux,flux2,sd3,zimage,dinov2,siglip}.py # text/class → LDM → PiD decode
├── from_clean_{flux,flux2,sd3,dinov2,siglip}.py # image → encode → PiD decode
├── _demo_common.py # shared CLI + run loop for from_ldm_*
├── _demo_from_clean_common.py # shared CLI + run loop for from_clean_*
├── checkpoint_registry.py # backbone → PiD checkpoint mapping
├── pipeline_registry.py # diffusers backbone → HF pipeline mapping
├── rae_generation.py # DINOv2-RAE LDM helpers (from_ldm_dinov2)
├── scale_rae_generation.py # Scale-RAE LDM helpers (from_ldm_siglip)
└── prompts/ # prompt files for from_ldm_*
PiD codebase is licensed under the Apache License 2.0.
See CONTRIBUTING.md for development setup, code style,
and the DCO sign-off requirement.
The authors would like to acknowledge Yongsheng Yu and Wei Xiong for open-sourcing PixelDiT's model and weights, and thank Product Managers Aditya Mahajan and Matt Cragun for their valuable support and guidance.
@article{lu2026pid,
title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
journal={arXiv preprint arXiv:2605.23902},
year={2026}
}