UniM²AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

Paper | BibTeX

This is the official PyTorch implementation of the paper - UniM²AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving.

Results

Pre-training

We provide our pretrained weights. You can load the pretrained UniM²AE(UniM²AE for BEVFusion and UniM²AE-sst-pre for SST) to train the multi-modal detector(BEVFusion) or the LiDAR-only detector(SST).

Model	Modality	Checkpoint
UniM²AE	C+L	Link
UniM²AE-sst-pre	L	Link
swint-nuImages	C	Link

Note: The checkpoint(denoted as swint-nuImages) pretrained on nuImages is provided by BEVFusion.

3D Object Detection (on nuScenes validation)

Model	Modality	mAP	NDS	Checkpoint
TransFusion-L-SST	L	65.0	69.9	Link
UniM²AE-L	L	65.7	70.4	Link
BEVFusion-SST	C+L	68.2	71.5	Link
UniM²AE	C+L	68.4	71.9	Link
UniM²AE w/MMIM	C+L	69.7	72.7	Link

3D Object Detection (on nuScenes test)

Model	Modality	mAP	NDS
UniM²AE-L	L	67.9	72.2
UniM²AE	C+L	70.3	73.3

Here, we train the UniM²AE-L and the UniM²AE on the trainval split of the nuScenes dataset and test them without any test time augmentation.

BEV Map Segmentation (on nuScenes validation)

Model	Modality	mIoU	Checkpoint
BEVFusion	C	51.2	Link
UniM²AE	C	52.9	Link
BEVFusion-SST	C+L	61.3	Link
UniM²AE	C+L	61.4	Link
UniM²AE w/MMIM	C+L	67.8	Link

Prerequisites

Pre-training

Python == 3.8
mmcv-full == 1.4.0
mmdetection = 2.14.0
torch == 1.9.1+cu111
torchvision == 0.10.1+cu111
numpy == 1.19.5
matplotlib == 3.6.2
pyquaternion == 0.9.9
scikit-learn == 1.1.3
setuptools == 59.5.0

After installing these dependencies, please run this command to install the codebase:

cd Pretrain
python setup.py develop

Fine-tuning

The code of Fine-tuning are built with different libraries. Please refer to BEVFusion and Voxel-MAE.

Data Preparation

We follow the instructions from here to download the nuScenes dataset. Please remember to download both detection dataset and the map extension for BEV map segmentation.

After downloading the nuScenes dataset, please preprocess the nuScenes dataset by:

cd Finetune/bevfusion/
python tools/create_data.py nuscenes --root-path ./data/nuscenes --out-dir ./data/nuscenes --extra-tag nuscenes

and create the soft link in Pretrain/data, Finetune/sst/data with ln -s.

After data preparation, the directory structure is as follows:

UniM2AE
├──Finetune
│   ├──bevfusion
│   │   ├──tools
│   │   ├──configs
│   │   ├──data
│   │   │   ├── can_bus
│   │   │   │   ├── ...
│   │   │   ├──nuscenes
│   │   │   │   ├── maps
│   │   │   │   ├── samples
│   │   │   │   ├── sweeps
│   │   │   │   ├── v1.0-test
│   │   |   |   ├── v1.0-trainval
│   │   │   │   ├── nuscenes_database
│   │   │   │   ├── nuscenes_infos_train.pkl
│   │   │   │   ├── nuscenes_infos_val.pkl
│   │   │   │   ├── nuscenes_infos_test.pkl
│   │   │   │   ├── nuscenes_dbinfos_train.pkl
│   ├──sst
│   │   ├──data
│   │   │   ├──nuscenes
│   │   │   │   ├── ...
├──Pretrain
│   ├──mmdet3d
│   ├──tools
│   ├──configs
│   ├──data
│   │   ├── can_bus
│   │   │   ├── ...
│   │   ├──nuscenes
│   │   │   ├── ...

Pre-training

Training

Please run:

cd Pretrain
bash tools/dist_train.sh configs/unim2ae_mmim.py 8

and run the script for fine-tuning:

cd Pretrain
python tools/convert.py --source work_dirs/unim2ae_mmim/epoch_200.pth --target ../Finetune/bevfusion/pretrained/unim2ae-pre.pth

Visualization

To get the reconstruction results of the images and the LiDAR point cloud, please run:

cd Pretrain
python tools/test.py configs/unim2ae_mmim.py --checkpoint [pretrain checkpoint path] --show-pretrain --show-dir viz

Fine-tuning

We provide instructions to finetune BEVFusion and Voxel-MAE.

BEVFusion

Training

If you want to train the LiDAR-only UniM²AE-L for object detection, please run:

cd Finetune/bevfusion
torchpack dist-run -np 8 python tools/train.py configs/nuscenes/det/transfusion/secfpn/lidar/sstv2.yaml --load_from pretrained/unim2ae-lidar-only-pre.pth

For UniM²AE w/MMIM detection model, please run:

cd Finetune/bevfusion

python tools/convert.py --source [lidar-only UniM2AE-L checkpoint file path] --fuser pretrained/unim2ae-pre.pth --target pretrained/unim2ae-stage1.pth --stage2

torchpack dist-run -np 8 python tools/train.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/unim2ae_MMIM.yaml --load_from pretrained/unim2ae-stage1.pth

If you want to init the camera backbone with weight pretrained on nuImages, please run:

torchpack dist-run -np 8 python tools/train.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/unim2ae_MMIM.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth --load_from pretrained/unim2ae-stage1-L.pth

For UniM²AE detection model, please run:

cd Finetune/bevfusion

torchpack dist-run -np 8 python tools/train.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/bevfusion_sst.yaml --load_from pretrained/unim2ae-stage1.pth

If you want to init the camera backbone with weight pretrained on nuImages, please run:

cd Finetune/bevfusion

torchpack dist-run -np 8 python tools/train.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/bevfusion_sst.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth --load_from pretrained/unim2ae-L-det.pth

Note: The unim2ae-L.pth is the training results of the LiDAR-only UniM²AE-L for object detection.

For camera-only UniM²AE segmentation model, please run:

cd Finetune/bevfusion
torchpack dist-run -np 8 python tools/train.py configs/nuscenes/seg/camera-bev256d2.yaml --load_from pretrained/unim2ae-seg-c-pre.pth

For UniM²AE segmentation model, please run:

cd Finetune/bevfusion
torchpack dist-run -np 8 python tools/train.py configs/nuscenes/seg/fusion-sst.yaml --load_from pretrained/unim2ae-pre.pth

If you want to init the camera backbone with weight pretrained on nuImages, please run:

cd Finetune/bevfusion
torchpack dist-run -np 8 python tools/train.py configs/nuscenes/seg/fusion-sst.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth --load_from pretrained/unim2ae-seg-pre.pth

For UniM²AE w/MMIM segmentation model, please run:

cd Finetune/bevfusion
torchpack dist-run -np 8 python tools/train.py configs/nuscenes/seg/unim2ae_MMIM.yaml --load_from pretrained/unim2ae-pre.pth

If you want to init the camera backbone with weight pretrained on nuImages, please run:

cd Finetune/bevfusion
torchpack dist-run -np 8 python tools/train.py configs/nuscenes/seg/unim2ae_MMIM.yaml --model.encoders.camera.backbone.init_cfg.checkpoint pretrained/swint-nuimages-pretrained.pth --load_from pretrained/unim2ae-seg-pre.pth

Evaluation

Please run:

cd Finetune/bevfusion
torchpack dist-run -np 8 python tools/test.py [config file path] pretrained/[checkpoint name].pth --eval [evaluation type]

For example, if you want to evaluate the detection model, please run:

cd Finetune/bevfusion
torchpack dist-run -np 8 python tools/test.py configs/nuscenes/det/transfusion/secfpn/camera+lidar/swint_v0p075/unim2ae_MMIM.yaml pretrained/unim2ae-mmim-det.pth --eval bbox

If you want to evaluate the segmentation model, please run:

cd Finetune/bevfusion
torchpack dist-run -np 8 python tools/test.py configs/nuscenes/seg/unim2ae_MMIM.yaml pretrained/unim2ae-mmim-seg.pth --eval map

SST

Training

To train the LiDAR-only anchor-based detector, please run

cd Finetune/sst
bash tools/dist_train.sh configs/sst_refactor/sst_10sweeps_VS0.5_WS16_ED8_epochs288_intensity.py 8 --cfg-options 'load_from=pretrained/unim2ae-sst-pre.pth'

Evaluation

To evaluate the LiDAR-only anchor-based detector, please run

cd Finetune/sst
bash tools/dist_train.sh configs/sst_refactor/sst_10sweeps_VS0.5_WS16_ED8_epochs288_intensity.py [checkpoint file path] 8

Acknowledgement

UniM²AE is based on mmdetection3d. This repository is also inspired by the following outstanding contributions to the open-source community: 3DETR, BEVFormer, DETR, BEVFusion, MAE, Voxel-MAE, GreenMIM, SST, TransFusion.

Citation

If you find UniM²AE is helpful to your research, please consider citing our work:

@article{zou2023unim,
  title={UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving},
  author={Zou, Jian and Huang, Tianyu and Yang, Guanglei and Guo, Zhenhua and Zuo, Wangmeng},
  journal={arXiv preprint arXiv:2308.10421},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Assets		Assets
Finetune		Finetune
Pretrain		Pretrain
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

hollow-503/UniM2AE

Folders and files

Latest commit

History

Repository files navigation

UniM2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

Paper | BibTeX

Results

Pre-training

3D Object Detection (on nuScenes validation)

3D Object Detection (on nuScenes test)

BEV Map Segmentation (on nuScenes validation)

Prerequisites

Pre-training

Fine-tuning

Data Preparation

Pre-training

Training

Visualization

Fine-tuning

BEVFusion

Training

Evaluation

SST

Training

Evaluation

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Languages

UniM²AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving