Hypothesis: robotic policies would benefit from a memory bank of sorts. If we can successfully demonstrate this with a toy problem using pi0, this provides evidence that having something that resembles memory may benefit the model. I'd love to call this pi-e (pi with experience)
Implement Physical Intelligence's Pi architecture from scratch, building up incrementally through the key papers/techniques that led to it:
- Behavior cloning - supervised learning baseline
- Action chunking - predict action sequences (key idea from ACT)
- Transformer decoder with action queries - ACT-style architecture
- ViT encoder - replace CNN with Vision Transformer
- Flow matching - generative action modeling (replaces direct regression)
- VLA - add language conditioning for full Pi0-style model
- ACT (2023, from Pi team) introduced action chunking + transformer decoder with learned action queries
- Pi0 builds on ACT, adding flow matching and language conditioning
- Understanding ACT's transformer decoder is key to understanding Pi0
Simple 2D ball interception task:
- Red ball bounces around the screen
- Blue end-effector (robot) must intercept it
- Observation: 256x256 RGB image
- Action: (dx, dy) velocity command
pi/
├── env/ # Environment
│ └── moving_object.py
├── expert/ # Expert policy for demonstrations
│ └── expert_policy.py
├── policy/ # Learned policies
├── scripts/ # Training and evaluation
├── data/ # Collected demonstrations
└── visualize.py # Visualization
Visualize with expert policy:
python visualize.py- Environment
- Expert policy
- Data collection
- Behavior cloning policy (single-frame)
- DAgger for single-frame BC
- Action chunking
- Transformer decoder with action queries (ACT-style)
- ViT encoder
- Flow matching
- Language conditioning (VLA)
Attempted stacking 3 frames along channel dimension (H, W, 9) to provide temporal information. This approach had several problems:
-
Zero-frame contamination: First observations in each episode have zero-padding for older frames, teaching the model to ignore temporal channels.
-
No temporal inductive bias: Conv2d treats all 9 channels equally - it doesn't know channels 0-2, 3-5, 6-8 represent different time steps.
-
DAgger bootstrap failure: Collecting DAgger data with a poorly-trained policy produces low-quality data that doesn't help.
-
It's a 2015-era technique: Frame stacking was popularized by DQN for Atari. Modern approaches (ACT, Pi0) handle temporal information differently:
- Encode each frame separately with a vision encoder
- Use transformer attention over frame embeddings
- Or use action chunking, which implicitly captures dynamics
Takeaway: Skip frame stacking. For temporal reasoning, use action chunking or attention over frame sequences.
After abandoning multi-frame BC, we returned to single-frame BC and implemented DAgger properly.
Training setup (same for BC and BC+DAgger):
- 10k datapoints
- Learning rate: 1e-3
- Train/val split: 70/30
- Batch size: 64
- Epochs: 45
- Simple CNN encoder (2 conv layers → MLP)
DAgger collection:
- Run learned policy in environment (which drifts to edges/corners)
- Label those edge states with expert recovery actions
- Aggregate with original expert data and retrain from scratch
Why it works:
- Expert is good → stays centered → expert data lacks edge samples
- Learned policy is imperfect → visits edges → DAgger captures edge recovery data
- Result: policy learns to avoid/recover from edges
Results:
| Policy | Video |
|---|---|
| Random | 00_random_policy.mp4 |
| Expert | 01_expert_policy.mp4 |
| BC (single-frame) | 03_bc_policy.mp4 |
| BC + DAgger | 06_bc_policy_dagger.mp4 |
| Multi-frame BC (abandoned) | 04_multi_img_bc_policy.mp4 |
| Action chunking (open-loop) | 07_action_chunking_policy.mp4 |
| Action chunking (RH4 + episode ends + padding) | 07_action_chunking_policy_rh4_episode_ends_padded.mp4 |
Predict 8 future actions at once instead of 1. Key idea from ACT that carries into Pi0.
Execution modes:
- Open-loop: execute all 8 actions, then re-predict
- Receding horizon: predict 8, execute fewer (e.g., 4), then re-predict
Data improvements:
- Added
episode_endstracking to avoid cross-episode contamination - Zero-pad action chunks at episode boundaries
Results: Receding horizon (4) + clean episode data produces smoother execution. However, for this simple task, not dramatically better than BC + DAgger. Action chunking likely shines more on complex tasks with temporal structure.
We're already performing this task quite well with BC + DAgger. But we'll continue with this toy example because it's easier to build and compare architectures when keeping the task the same. Once we understand the full Pi0 architecture (transformers, flow matching, VLA), we can expand to harder scenarios:
- Multi-step tasks (catch ball → carry to goal)
- Partial observability (ball goes behind occluder, needs memory)
- Variable dynamics (ball behavior changes mid-episode)
- Multi-object reasoning (multiple balls, specific order)
- Longer horizon planning