Small standalone demos for the public PSI-0.5 model.
PSI paper | Release blog post | Example gallery
PSIv0.5 is an RGBCFD model, meaning it integrates the following modes: RGB, the standard pixel description of video frames; C, camera motion between frames; F, optical flow between adjacent frames; and D, depth of a given frame. PSIv0.5 was trained on a large dataset of real-world videos, similar to the one described in the PSI paper.
PSIv0.5 treats visual reasoning tasks as sequence-completion problems over visual tokens. A task is defined by a sequence of RGB, camera, flow, and depth tokens, together with a choice of which tokens are observed and which ones the model should predict. Under this view, next-frame prediction, interpolation, motion estimation, and controlled generation are not separate capabilities, but different queries to the same autoregressive model. This lets us use intermediate structure, such as flow and depth, as a control surface for interacting with scenes: poking objects, opening doors, folding paper, or asking how the world might evolve under a different motion.
These scripts load PSI only through Hugging Face Transformers remote code:
from transformers import AutoModel
predictor = AutoModel.from_pretrained(
"StanfordNeuroAILab/psi0_5",
trust_remote_code=True,
device="cuda:0",
)git clone https://github.com/neuroailab/psi-demos.git
cd psi-demos
conda create -n psi-demos python=3.10 -y
conda activate psi-demos
pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txtThe PyTorch command above installs the CUDA 12.6 wheel used on the ccn2 A40
nodes. For other machines, install the PyTorch build recommended for your
driver/platform, then run pip install -r requirements.txt.
If the model is gated for your account, log in first:
huggingface-cli loginDepth-conditioned examples can either load --depth0-npy or compute depth with
Depth Anything 3 from an environment where depth_anything_3 is installed.
python generate_unified.py --server-port 7860 --device cuda:0The app combines next-frame generation, sparse-flow prompting, camera-conditioned novel view synthesis, free-form notation prompting, automatic DA3 depth, point-cloud preview, 3D axis rotation prompts, 3D pokes, and sparse-to-dense flow prompting in one interface.
python nvs_from_geometry.py \
--image /path/to/image.png \
--notation "rgb0,c01->rgb1" \
--device cuda:0The notation controls both inputs and outputs. Supported geometry inputs are
rgb0, rgb1, d0, d1, f01, and c01; depth is loaded from
--depth0-npy or computed with Depth Anything 3, and f01/projected d1 are
computed analytically from camera motion when not supplied. The camera-only
variant above works in the base install; depth-conditioned variants require
either --depth0-npy or an environment with Depth Anything 3 installed.
Useful variants:
python nvs_from_geometry.py --image /path/to/image.png --depth0-npy /path/to/d0.npy --notation "rgb0,d0,f01->rgb1"
python nvs_from_geometry.py --image /path/to/image.png --flow01-npy /path/to/f01.npy --notation "rgb0,f01->rgb1"
python nvs_from_geometry.py --image /path/to/image.png --notation "rgb0,c01->rgb1"
python nvs_from_geometry.py --image /path/to/image.png --depth0-npy /path/to/d0.npy --notation "rgb0,c01,f01->d1,rgb1" --rollout-strategy staged_densepython visual_statistics.py --image /path/to/image.png --device cuda:0This writes the generated frame plus a patchwise entropy heatmap from PSI-0.5 RGB logits.
@misc{kotar2025worldmodelingpsi,
title={World Modeling with Probabilistic Structure Integration},
author={Klemen Kotar and Wanhee Lee and Rahul Venkatesh and Honglin Chen and Daniel Bear and Jared Watrous and Simon Kim and Khai Loong Aw and Lilian Naing Chen and Stefan Stojanov and Kevin Feigelis and Imran Thobani and Alex Durango and Khaled Jedoui and Atlas Kazemian and Dan Yamins},
year={2025},
eprint={2509.09737},
archivePrefix={arXiv},
url={https://arxiv.org/abs/2509.09737}
}PSIv0.5 is a modestly sized model that has not undergone any post-training yet.
Some of its rollouts diverge. We recommend unrestricted sampling for flow
prediction and top_p=0.9, top_k=1000 for RGB rendering. Correct prompting
can significantly improve generations, and simple harnesses such as those in the
provided Gradio app can be used to steer the model much more effectively. We
believe this direction has great potential for scaling to create even more
comprehensive models of the world while maintaining this highly controllable
API.