Skip to content

neuroailab/psi-demos

Repository files navigation

PSI-0.5 Demos

Small standalone demos for the public PSI-0.5 model.

PSI paper | Release blog post | Example gallery

PSIv0.5 is an RGBCFD model, meaning it integrates the following modes: RGB, the standard pixel description of video frames; C, camera motion between frames; F, optical flow between adjacent frames; and D, depth of a given frame. PSIv0.5 was trained on a large dataset of real-world videos, similar to the one described in the PSI paper.

PSIv0.5 treats visual reasoning tasks as sequence-completion problems over visual tokens. A task is defined by a sequence of RGB, camera, flow, and depth tokens, together with a choice of which tokens are observed and which ones the model should predict. Under this view, next-frame prediction, interpolation, motion estimation, and controlled generation are not separate capabilities, but different queries to the same autoregressive model. This lets us use intermediate structure, such as flow and depth, as a control surface for interacting with scenes: poking objects, opening doors, folding paper, or asking how the world might evolve under a different motion.

These scripts load PSI only through Hugging Face Transformers remote code:

from transformers import AutoModel

predictor = AutoModel.from_pretrained(
    "StanfordNeuroAILab/psi0_5",
    trust_remote_code=True,
    device="cuda:0",
)

Setup

git clone https://github.com/neuroailab/psi-demos.git
cd psi-demos
conda create -n psi-demos python=3.10 -y
conda activate psi-demos
pip install torch==2.7.1 --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txt

The PyTorch command above installs the CUDA 12.6 wheel used on the ccn2 A40 nodes. For other machines, install the PyTorch build recommended for your driver/platform, then run pip install -r requirements.txt.

If the model is gated for your account, log in first:

huggingface-cli login

Depth-conditioned examples can either load --depth0-npy or compute depth with Depth Anything 3 from an environment where depth_anything_3 is installed.

Gradio App

python generate_unified.py --server-port 7860 --device cuda:0

The app combines next-frame generation, sparse-flow prompting, camera-conditioned novel view synthesis, free-form notation prompting, automatic DA3 depth, point-cloud preview, 3D axis rotation prompts, 3D pokes, and sparse-to-dense flow prompting in one interface.

Novel View Synthesis From Geometry

python nvs_from_geometry.py \
  --image /path/to/image.png \
  --notation "rgb0,c01->rgb1" \
  --device cuda:0

The notation controls both inputs and outputs. Supported geometry inputs are rgb0, rgb1, d0, d1, f01, and c01; depth is loaded from --depth0-npy or computed with Depth Anything 3, and f01/projected d1 are computed analytically from camera motion when not supplied. The camera-only variant above works in the base install; depth-conditioned variants require either --depth0-npy or an environment with Depth Anything 3 installed.

Useful variants:

python nvs_from_geometry.py --image /path/to/image.png --depth0-npy /path/to/d0.npy --notation "rgb0,d0,f01->rgb1"
python nvs_from_geometry.py --image /path/to/image.png --flow01-npy /path/to/f01.npy --notation "rgb0,f01->rgb1"
python nvs_from_geometry.py --image /path/to/image.png --notation "rgb0,c01->rgb1"
python nvs_from_geometry.py --image /path/to/image.png --depth0-npy /path/to/d0.npy --notation "rgb0,c01,f01->d1,rgb1" --rollout-strategy staged_dense

Visual Statistics

python visual_statistics.py --image /path/to/image.png --device cuda:0

This writes the generated frame plus a patchwise entropy heatmap from PSI-0.5 RGB logits.

Citation

@misc{kotar2025worldmodelingpsi,
  title={World Modeling with Probabilistic Structure Integration},
  author={Klemen Kotar and Wanhee Lee and Rahul Venkatesh and Honglin Chen and Daniel Bear and Jared Watrous and Simon Kim and Khai Loong Aw and Lilian Naing Chen and Stefan Stojanov and Kevin Feigelis and Imran Thobani and Alex Durango and Khaled Jedoui and Atlas Kazemian and Dan Yamins},
  year={2025},
  eprint={2509.09737},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2509.09737}
}

PSIv0.5 is a modestly sized model that has not undergone any post-training yet. Some of its rollouts diverge. We recommend unrestricted sampling for flow prediction and top_p=0.9, top_k=1000 for RGB rendering. Correct prompting can significantly improve generations, and simple harnesses such as those in the provided Gradio app can be used to steer the model much more effectively. We believe this direction has great potential for scaling to create even more comprehensive models of the world while maintaining this highly controllable API.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages