Haihong Hao🍕 · Lei Chen🍕 · Mingfei Han🌭 · Changlin Li🍔 · Dong An⭐ · Yuqiang Yang🌈 · Zhihui Li🍕 · Xiaojun Chang🍕
🍕University of Science and Technology of China 🌭MBZUAI 🍔Stanford University ⭐Amap, Alibaba Group 🌈Shanghai AI Laboratory
- Dream ahead before acting. LatentPilot learns action-conditioned visual dynamics from future observations during training.
- No future frames at inference. The model internalizes future-aware reasoning while requiring only current observations at test time.
- Latent visual reasoning across time. Learned latent tokens are carried across steps, enabling compact memory and scene-aware decision making.
- Flywheel-style training. On-policy trajectory collection and retraining progressively align learning with the agent's real behavior distribution.
- Strong performance in simulation and reality. LatentPilot achieves new SOTA on R2R-CE, RxR-CE, and R2R-PE, and transfers effectively to real robots.
Existing vision-and-language navigation models mainly reason over past and current observations, while largely overlooking how actions reshape future views. LatentPilot addresses this limitation by learning action-conditioned visual dynamics from future observations during training.
Its learned latent tokens evolve across time, serve as both output and next-step input, and enable the agent to reason about what the scene will look like after acting. This future-aware mechanism helps the policy better understand environment-action causality and make more robust navigation decisions.
Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can "imagine" the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices.
Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better match the agent's behavior distribution, with an expert takeover triggered when the agent deviates excessively.
LatentPilot further learns visual latent tokens without explicit supervision; these latent tokens attend globally in a continuous latent space and are carried across steps, serving as both the current output and the next input, thereby enabling the agent to "dream ahead" and reason about how actions will affect subsequent observations. Experiments on R2R-CE, RxR-CE, and R2R-PE benchmarks achieve new SOTA results, and real-robot tests across diverse environments demonstrate LatentPilot's superior understanding of environment-action dynamics in scene.
LatentPilot is built around three key ideas:
- Future supervision during training to learn action-conditioned scene dynamics.
- Latent tokens across time to maintain compact, future-aware visual reasoning.
- Flywheel-style on-policy retraining to adapt learning to the policy's own behavior distribution.
- Release inference code
- Release model weights
- Release data preparation scripts
