pi-e

Hypothesis: robotic policies would benefit from a memory bank of sorts. If we can successfully demonstrate this with a toy problem using pi0, this provides evidence that having something that resembles memory may benefit the model. I'd love to call this pi-e (pi with experience)

Goal

Implement Physical Intelligence's Pi architecture from scratch, building up incrementally through the key papers/techniques that led to it:

Behavior cloning - supervised learning baseline
Action chunking - predict action sequences (key idea from ACT)
Transformer decoder with action queries - ACT-style architecture
ViT encoder - replace CNN with Vision Transformer
Flow matching - generative action modeling (replaces direct regression)
VLA - add language conditioning for full Pi0-style model

Why this progression?

ACT (2023, from Pi team) introduced action chunking + transformer decoder with learned action queries
Pi0 builds on ACT, adding flow matching and language conditioning
Understanding ACT's transformer decoder is key to understanding Pi0

Environment

Simple 2D ball interception task:

Red ball bounces around the screen
Blue end-effector (robot) must intercept it
Observation: 256x256 RGB image
Action: (dx, dy) velocity command

Structure

pi/
├── env/                  # Environment
│   └── moving_object.py
├── expert/               # Expert policy for demonstrations
│   └── expert_policy.py
├── policy/               # Learned policies
├── scripts/              # Training and evaluation
├── data/                 # Collected demonstrations
└── visualize.py          # Visualization

Usage

Visualize with expert policy:

python visualize.py

Progress

Lessons Learned

Multi-frame BC (frame stacking) - Abandoned

Attempted stacking 3 frames along channel dimension (H, W, 9) to provide temporal information. This approach had several problems:

Zero-frame contamination: First observations in each episode have zero-padding for older frames, teaching the model to ignore temporal channels.
No temporal inductive bias: Conv2d treats all 9 channels equally - it doesn't know channels 0-2, 3-5, 6-8 represent different time steps.
DAgger bootstrap failure: Collecting DAgger data with a poorly-trained policy produces low-quality data that doesn't help.
It's a 2015-era technique: Frame stacking was popularized by DQN for Atari. Modern approaches (ACT, Pi0) handle temporal information differently:
- Encode each frame separately with a vision encoder
- Use transformer attention over frame embeddings
- Or use action chunking, which implicitly captures dynamics

Takeaway: Skip frame stacking. For temporal reasoning, use action chunking or attention over frame sequences.

Single-frame BC + DAgger

After abandoning multi-frame BC, we returned to single-frame BC and implemented DAgger properly.

Training setup (same for BC and BC+DAgger):

10k datapoints
Learning rate: 1e-3
Train/val split: 70/30
Batch size: 64
Epochs: 45
Simple CNN encoder (2 conv layers → MLP)

DAgger collection:

Run learned policy in environment (which drifts to edges/corners)
Label those edge states with expert recovery actions
Aggregate with original expert data and retrain from scratch

Why it works:

Expert is good → stays centered → expert data lacks edge samples
Learned policy is imperfect → visits edges → DAgger captures edge recovery data
Result: policy learns to avoid/recover from edges

Results:

Policy	Video
Random	00_random_policy.mp4
Expert	01_expert_policy.mp4
BC (single-frame)	03_bc_policy.mp4
BC + DAgger	06_bc_policy_dagger.mp4
Multi-frame BC (abandoned)	04_multi_img_bc_policy.mp4
Action chunking (open-loop)	07_action_chunking_policy.mp4
Action chunking (RH4 + episode ends + padding)	07_action_chunking_policy_rh4_episode_ends_padded.mp4

Action Chunking

Predict 8 future actions at once instead of 1. Key idea from ACT that carries into Pi0.

Execution modes:

Open-loop: execute all 8 actions, then re-predict
Receding horizon: predict 8, execute fewer (e.g., 4), then re-predict

Data improvements:

Added episode_ends tracking to avoid cross-episode contamination
Zero-pad action chunks at episode boundaries

Results: Receding horizon (4) + clean episode data produces smoother execution. However, for this simple task, not dramatically better than BC + DAgger. Action chunking likely shines more on complex tasks with temporal structure.

Why Keep the Simple Task?

We're already performing this task quite well with BC + DAgger. But we'll continue with this toy example because it's easier to build and compare architectures when keeping the task the same. Once we understand the full Pi0 architecture (transformers, flow matching, VLA), we can expand to harder scenarios:

Multi-step tasks (catch ball → carry to goal)
Partial observability (ball goes behind occluder, needs memory)
Variable dynamics (ball behavior changes mid-episode)
Multi-object reasoning (multiple balls, specific order)
Longer horizon planning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pi-e

Goal

Why this progression?

Environment

Structure

Usage

Progress

Lessons Learned

Multi-frame BC (frame stacking) - Abandoned

Single-frame BC + DAgger

Action Chunking

Why Keep the Simple Task?

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
env		env
expert		expert
notes		notes
policy		policy
scripts		scripts
.gitignore		.gitignore
README.md		README.md
visualize.py		visualize.py

kying18/pi-e

Folders and files

Latest commit

History

Repository files navigation

pi-e

Goal

Why this progression?

Environment

Structure

Usage

Progress

Lessons Learned

Multi-frame BC (frame stacking) - Abandoned

Single-frame BC + DAgger

Action Chunking

Why Keep the Simple Task?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages