Skip to content

oceanhao/latentpilot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

project page

Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning

Haihong Hao🍕 · Lei Chen🍕 · Mingfei Han🌭 · Changlin Li🍔 · Dong An · Yuqiang Yang🌈 · Zhihui Li🍕 · Xiaojun Chang🍕

🍕University of Science and Technology of China    🌭MBZUAI    🍔Stanford University    Amap, Alibaba Group    🌈Shanghai AI Laboratory

LatentPilot Logo

project page

paper code models license


✨ Highlights

  • Dream ahead before acting. LatentPilot learns action-conditioned visual dynamics from future observations during training.
  • No future frames at inference. The model internalizes future-aware reasoning while requiring only current observations at test time.
  • Latent visual reasoning across time. Learned latent tokens are carried across steps, enabling compact memory and scene-aware decision making.
  • Flywheel-style training. On-policy trajectory collection and retraining progressively align learning with the agent's real behavior distribution.
  • Strong performance in simulation and reality. LatentPilot achieves new SOTA on R2R-CE, RxR-CE, and R2R-PE, and transfers effectively to real robots.

🎬 Video Showcase

Main Teaser

Direct video link

First-Person View

Direct video link

Third-Person View

Direct video link


🧠 Overview

Existing vision-and-language navigation models mainly reason over past and current observations, while largely overlooking how actions reshape future views. LatentPilot addresses this limitation by learning action-conditioned visual dynamics from future observations during training.

Its learned latent tokens evolve across time, serve as both output and next-step input, and enable the agent to reason about what the scene will look like after acting. This future-aware mechanism helps the policy better understand environment-action causality and make more robust navigation decisions.


🍹 Abstract

Existing vision-and-language navigation (VLN) models primarily reason over past and current visual observations, while largely ignoring the future visual dynamics induced by actions. As a result, they often lack an effective understanding of the causal relationship between actions and how the visual world changes, limiting robust decision-making. Humans, in contrast, can "imagine" the near future by leveraging action-dynamics causality, which improves both environmental understanding and navigation choices.

Inspired by this capability, we propose LatentPilot, a new paradigm that exploits future observations during training as a valuable data source to learn action-conditioned visual dynamics, while requiring no access to future frames at inference. Concretely, we propose a flywheel-style training mechanism that iteratively collects on-policy trajectories and retrains the model to better match the agent's behavior distribution, with an expert takeover triggered when the agent deviates excessively.

LatentPilot further learns visual latent tokens without explicit supervision; these latent tokens attend globally in a continuous latent space and are carried across steps, serving as both the current output and the next input, thereby enabling the agent to "dream ahead" and reason about how actions will affect subsequent observations. Experiments on R2R-CE, RxR-CE, and R2R-PE benchmarks achieve new SOTA results, and real-robot tests across diverse environments demonstrate LatentPilot's superior understanding of environment-action dynamics in scene.


🧩 Method at a Glance

LatentPilot is built around three key ideas:

  1. Future supervision during training to learn action-conditioned scene dynamics.
  2. Latent tokens across time to maintain compact, future-aware visual reasoning.
  3. Flywheel-style on-policy retraining to adapt learning to the policy's own behavior distribution.

🍻 TODOs

  • Release inference code
  • Release model weights
  • Release data preparation scripts

About

Champion of VLN-PE in InternUtopia and Real World Challenge, IROS 2025

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors