NVIDIA University of Modena and Reggio Emilia University of Toronto ETH Zurich
Alessandro Burzio*, Tobias Fischer*, Sven Elflein, Qunjie Zhou, Riccardo de Lutio, Jiawei Ren, Jiahui Huang, Shengyu Huang, Marc Pollefeys, Laura Leal-Taixé, Zan Gojcic+, Haithem Turki+
DéjàView (DVLT) is a recurrent transformer for multi-view 3D reconstruction. It
loops a shared block of frame/global attention with discrete depth indexing,
producing per-pixel rays, depth, confidence, and camera poses from an unordered
set of images. Trained once, the number of refinement steps K becomes an
inference-time compute knob, matching or outperforming substantially larger
feed-forward baselines at a fraction of their parameters.
This repository contains:
- The DVLT model + four configurable ablations (vanilla, decoupled blocks,
no
s_outtoken, no depth-scaling). - Evaluation wrappers for five baselines: VGGT, VGGT-Omega, Depth-Anything-3, MapAnything, and Pi3. Each wrapper imports the upstream package (installed separately — see INSTALL.md).
- A training stack built on
accelerate+ Hydra, with optional W&B logging. - A Stage-2 fine-tune recipe for the depth-conv head.
- Rerun-based visualization tools.
- Inference code
- Model weights
- Evaluation code
- eval datasets preprocess and loaders
- Training code
- ScanNet++ training dataset loader
- other training dataset loaders
See docs/INSTALL.md. The short version:
conda create -n dvlt python=3.12 && conda activate dvlt
conda install pytorch=2.5.1 torchvision pytorch-cuda=12.4 -c pytorch -c nvidia -c conda-forge
pip install -e .[all]Quick example script:
import torch
from accelerate import Accelerator
from dvlt.model.dvlt.model import DVLT
from dvlt.util.preprocess import load_sequence, preprocess_images
checkpoint_path = "nvidia/dvlt" # local dir, HTTPS URL, or HF Hub repo id
# load_sequence accepts a directory, a single video, or an explicit list of files.
input_path = "path/to/scene_dir"
# Or: input_path = "path/to/clip.mp4"
# Or: from glob import glob; input_path = sorted(glob("path/to/scene_dir/*.png"))
accelerator = Accelerator(mixed_precision="bf16")
model = DVLT(img_size=504)
model.load_pretrained(checkpoint_path, strict=True)
model.setup_test(accelerator)
_, frames = load_sequence(input_path)
batch = preprocess_images(frames, img_size=504, patch_size=14, device=accelerator.device)
with torch.no_grad(), accelerator.autocast():
predictions = model.predict(batch, accelerator)
cameras = predictions["cameras"][0] # Cameras object with shape [S]
extrinsics_c2w = cameras.camera_to_worlds # (S, 3, 4) — OpenCV convention [R | t]
intrinsics = cameras.get_intrinsics_matrices() # (S, 3, 3)
depths = predictions["depths"][0] # (S, H, W)
world_points = predictions["world_points"][0] # (S, H, W, 3)# Single-GPU
python -m dvlt.scripts.train --config-name dvlt-large data=scannetpp
# Multi-GPU (4 GPUs)
accelerate launch --num-processes 4 -m dvlt.scripts.train --config-name dvlt-large data=scannetpp
# Resume
python -m dvlt.scripts.train \
--config-dir=outputs/<run> \
--config-name=config.yaml \
trainer.resume_from_checkpoint=latestbenchmark_lite (DTU, ETH3D, 7Scenes) is a convenience benchmark over the
datasets that don't require heavy preprocessing; the full benchmark adds
ScanNet++ and
NuScenes.
python -m dvlt.scripts.test --config-name dvlt data=benchmark
# multi-GPU: accelerate launch --num-processes <N> -m dvlt.scripts.test --config-name dvlt data=benchmark
python -m dvlt.scripts.test --config-name dvlt data=benchmark_lite
# multi-GPU: accelerate launch --num-processes <N> -m dvlt.scripts.test --config-name dvlt data=benchmark_liteDVLT reference results on the full benchmark:
| Dataset | Pose AUC@3 | Pose AUC@30 | Depth inlier@3% | Depth AbsRel |
|---|---|---|---|---|
| DTU | 0.8319 | 0.9880 | 0.9706 | 0.0093 |
| ETH3D | 0.6604 | 0.9536 | 0.7717 | 0.0267 |
| 7Scenes | 0.1393 | 0.8172 | 0.7437 | 0.0349 |
| ScanNet++ | 0.7941 | 0.9803 | 0.9239 | 0.0167 |
| NuScenes | 0.4340 | 0.8534 | 0.5853 | 0.0673 |
Browser UI for uploading images / video and exploring the predicted 3D point cloud, depth maps and camera trajectory. The dropdown switches between DVLT and the baseline wrappers (VGGT, VGGT-Omega, DA3, Pi3, MapAnything); each baseline requires its upstream package installed (see docs/INSTALL.md).
# Launch on http://localhost:7860 (DVLT preselected)
python -m dvlt.scripts.gradio_demoThe same script also has a headless offline mode that skips Gradio
and writes a .glb + .rrd per (sequence, model) under
demo_outputs/<sequence_name>/. --input accepts a directory of images, a
single image, or a video file (mp4/mov/gif/...), and may be repeated to
process multiple sequences in one go; --models is a comma-separated list of
config names from the curated registry (or all).
# Run two models on two sequences (one image dir, one video)
python -m dvlt.scripts.gradio_demo --offline \
--input /path/to/scene_dir \
--input /path/to/clip.mp4 \
--models dvlt
# Run every registered model on one sequence
python -m dvlt.scripts.gradio_demo --offline --input /path/to/scene_dir --models allDVLT uses Hydra for configuration. Top-level experiment
configs live in src/dvlt/config/experiments/:
| Config | Description |
|---|---|
dvlt-large |
Stage-1 recipe (large model, full training schedule, linear depth head). |
dvlt-large-ablation |
Vanilla ablation parent — toggle decoupled blocks, no-s_out, no-depthscale via overrides. |
dvlt-large-ablation-decoupled |
Fully decoupled blocks (recurrence_mode=none, no looping): a distinct block per step, fixed 16 steps. |
dvlt-large-depthconv-stage2 |
Stage-2 depth-conv head fine-tune (matches the released checkpoint and the model's default depth_head_type="conv"). |
dvlt |
Inference-only alias for the released stage-2 checkpoint. |
vggt, vggt_omega, da3-{base,large,giant}, pi3, pi3x, mapanything |
Eval-only baseline wrappers. Require the upstream package installed (see INSTALL.md). |
Per-user settings (most importantly, the dataset root) live in
src/dvlt/config/experiments/user/. Copy default.yaml to local.yaml,
edit data_root, and select it via user=local:
python -m dvlt.scripts.train --config-name dvlt-large data=scannetpp user=localuser.data_root can also be overridden inline or via the DVLT_DATA_ROOT
environment variable.
Pick a single curated dataset config:
python -m dvlt.scripts.train --config-name dvlt-large data=scannetpp
python -m dvlt.scripts.train --config-name dvlt-large data=mixed_allFor scripts using the @cli decorator (train, test, visualize):
eval "$(python -m dvlt.scripts.train -sc install)"
# later, to remove:
eval "$(python -m dvlt.scripts.train -sc uninstall)"- docs/INSTALL.md — environment setup + baseline installs
- docs/data/DATA.md — data pipeline overview + how to add a new dataset parser
- docs/CONTRIB.md — dev setup, code style, tests
- docs/TESTING.md — full test-runner documentation
We are also grateful to several other open-source repositories that we drew inspiration from or built upon during the development of our pipeline:
If you find this work useful, please cite:
@article{burzio2026dejaview,
title = {D\'ej\`a View: Looping Transformers for Multi-View 3D Reconstruction},
author = {Burzio, Alessandro and Fischer, Tobias and Elflein, Sven and Zhou, Qunjie and de Lutio, Riccardo and Ren, Jiawei and Huang, Jiahui and Huang, Shengyu and Pollefeys, Marc and Leal-Taix{\'e}, Laura and Gojcic, Zan and Turki, Haithem},
journal = {arXiv preprint arXiv:2605.30215},
year = {2026}
}The DVLT code is released under the Apache License, Version 2.0 — see
LICENSE. The model weights (the nvidia/dvlt checkpoint) are
released under the NVIDIA License — non-commercial, research-and-evaluation
use only; see LICENSES/NVIDIA-LICENSE.txt.
Portions of the codebase are adapted from third-party open-source projects (DINOv2, PyTorch3D, MoGe, AnyCalib, MultiNeRF, Depth-Anything-3, VGGT). Each adapted file carries the upstream copyright + license notice in its header; see THIRD_PARTY_LICENSES.md for the full attribution map and full upstream license texts. The VGGT-derived files are distributed under the VGGT License; see LICENSES/VGGT-LICENSE.txt.
The baseline evaluation wrappers in src/dvlt/model/{vggt,vggt_omega,da3,mapanything,pi3}/
import (do not vendor) their respective upstream packages, each of which is
governed by its own license — see
THIRD_PARTY_LICENSES.md §"Upstream packages used
for evaluation".
