This repository contains the code that accompanies our ICCV 2023 paper XVO: Generalized Visual Odometry via Cross-Modal Self-Training. Please find our project page for more details.
We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and settings. Our XVO can efficiently learn to recover relative pose with real-world scale from visual scene semantics, i.e., without relying on any known camera parameters. Our key contribution is twofold. First, we empirically demonstrate the benefits of semi-supervised training for learning a general-purpose direct VO regression network. Second, we demonstrate multi-modal supervision, including segmentation, flow, depth, and audio auxiliary prediction tasks, to facilitate generalized representations for the VO task.
We use KITTI, Argoverse 2 and nuScenes dataset along with in-the-wild YouTube videos (available soon). Please find their websites for dataset setup.
Dataset | Download Link |
---|---|
KITTI | Link |
Argoverse 2 | Link |
nuScenes | Link |
YouTube | Coming Soon |
# create a new environment
conda create --name xvo python=3.9
conda activate xvo
# install pytorch1.13.0
conda install pytorch=1.13.0 torchvision pytorch-cuda=11.6 -c pytorch -c nvidia
Our environment also requires pytorch3d, and please refer to pytorch3d for installation guidlines.
Coming soon!
python3 test.py
cd vo-eval-tool
python3 eval_odom.py
VO evaluation tool is revised from https://github.com/Huangying-Zhan/kitti-odom-eval.
We find that incorporating audio and segmentation tasks as part of the semi-supervised learning process significantly improves ego-pose estimation on KITTI.
Please don't hesitate to contact us if you have any remarks or questions at leilai@bu.edu or sgzk@bu.edu.
Our work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
- Test code release
- Training code release
- Readme Update