LatentPilot

LatentPilot

Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

Haihong Hao^🍕 · Lei Chen^🍕 · Mingfei Han^🌭 · Changlin Li^🍔 · Dong An^⭐ · Yuqiang Yang^🌈 · Zhihui Li^🍕 · Xiaojun Chang^🍕

^🍕University of Science and Technology of China ^🌭MBZUAI ^🍔Stanford University ^⭐Amap, Alibaba Group ^🌈Shanghai AI Laboratory

✨ Highlights

Dream ahead before acting. LatentPilot learns action-conditioned visual dynamics from future observations during training.
No future frames at inference. The model internalizes future-aware reasoning while requiring only current observations at test time.
Latent visual reasoning across time. Learned latent tokens are carried across steps, enabling compact memory and scene-aware decision making.
Flywheel-style training. On-policy trajectory collection and retraining progressively align learning with the agent's real behavior distribution.
Strong performance in simulation and reality. LatentPilot achieves new SOTA on R2R-CE, RxR-CE, and R2R-PE, and transfers effectively to real robots.

🎬 Video Showcase

🧠 Overview

Existing vision-and-language navigation models mainly reason over past and current observations, while largely overlooking how actions reshape future views. LatentPilot addresses this limitation by learning action-conditioned visual dynamics from future observations during training.

Its learned latent tokens evolve across time, serve as both output and next-step input, and enable the agent to reason about what the scene will look like after acting. This future-aware mechanism helps the policy better understand environment-action causality and make more robust navigation decisions.

🍹 Abstract

Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can "imagine" the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices.

Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better match the agent's behavior distribution, with an expert takeover triggered when the agent deviates excessively.

LatentPilot further learns visual latent tokens without explicit supervision; these latent tokens attend globally in a continuous latent space and are carried across steps, serving as both the current output and the next input, thereby enabling the agent to "dream ahead" and reason about how actions will affect subsequent observations. Experiments on R2R-CE, RxR-CE, and R2R-PE benchmarks achieve new SOTA results, and real-robot tests across diverse environments demonstrate LatentPilot's superior understanding of environment-action dynamics in scene.

🧩 Method at a Glance

LatentPilot is built around three key ideas:

Future supervision during training to learn action-conditioned scene dynamics.
Latent tokens across time to maintain compact, future-aware visual reasoning.
Flywheel-style on-policy retraining to adapt learning to the policy's own behavior distribution.

🍻 TODOs

Release inference code
Release model weights
Release data preparation scripts

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
static		static
CNAME		CNAME
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LatentPilot

Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

✨ Highlights

🎬 Video Showcase

Main Teaser

First-Person View

Third-Person View

🧠 Overview

🍹 Abstract

🧩 Method at a Glance

🍻 TODOs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LatentPilot

Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

✨ Highlights

🎬 Video Showcase

Main Teaser

First-Person View

Third-Person View

🧠 Overview

🍹 Abstract

🧩 Method at a Glance

🍻 TODOs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages