This repository contains the official code from ''DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning''
- [2024/02/26]: DMR is accepted to CVPR 2024.
- [2022/11/18]: DMR is currently under review for CVPR 2024.
If this work is helpful for your research, please consider citing the following BibTeX entry.
@inproceedings{xu2024dmr,
title={DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning},
author={Xu, Haoran and Peng, Peixi and Tan, Guang and Li, Yuan and Xu, Xinhai and Tian, Yonghong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={26508--26518},
year={2024}
}
We explore visual reinforcement learning (RL) using two complementary visual modalities: frame-based RGB camera and event-based Dynamic Vision Sensor (DVS). Existing multi-modality visual RL methods often encounter challenges in effectively extracting task-relevant information from multiple modalities while suppressing the increased noise, only using indirect reward signals instead of pixel-level supervision. To tackle this, we propose a Decomposed Multi-Modality Representation (DMR) framework for visual RL. It explicitly decomposes the inputs into three distinct components: combined task-relevant features (co-features), RGB-specific noise, and DVS-specific noise. The co-features represent the full information from both modalities that is relevant to the RL task; the two noise components, each constrained by a data reconstruction loss to avoid information leak, are contrasted with the co-features to maximize their difference.
The overview of DMR learning framework:
- Illustration of our motivation
- Illustration of decomposition capability of DMR
- A long sequence demonstration
Time | RGB Frame | DVS Events | RGB Noise | DVS Noise | Co-features on RGB |
Co-features on DVS |
Time #1 | ||||||
Time #2 | ||||||
Time #3 |
The table above illustrates a vehicle with high beam headlights approaching from a distance to near in the opposite lane at three different time instances, Time #1, #2, and #3. It is clear that the RGB noise emphasizes the vehicle's high beam headlights and the buildings on the right, whereas the DVS noise focuses on the dense event region on the right. Both types of noise contain a substantial amount of task-irrelevant information, covering unnecessary broad areas. In contrast, the co-features generates a more focused area that is relevant for RL by excluding irrelevant regions. These areas precisely cover the vehicle on the opposite lane and the right roadside, which are crucial cues for driving policies.
The variations in Class Activation Mapping (CAM) closely mirror the alterations in the real scene throughout the entire process. When the vehicle approaches, the RGB noise broadens due to illumination changes, and the co-features focus more on the vehicle. In co-features, there is also a gradual increase in emphasis on the left roadside, and the CAM uniformly cover the right roadside.
- create python environment using conda:
conda create -n carla-py37 python=3.7 -y
conda activate carla-py37
pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu117
pip install -U gym==0.17.3 cloudpickle==1.5.0 numba==0.51.2 wincertstore==0.2 tornado==4.5.3 msgpack-python==0.5.6 msgpack-rpc-python==0.4.1 stable-baselines3==0.8.0 opencv-python==4.7.0.72 imageio[ffmpeg]==2.28.0 dotmap==1.3.30 termcolor==2.3.0 matplotlib==3.5.3 seaborn-image==0.4.4 scipy==1.7.3 info-nce-pytorch==0.1.4 spikingjelly cupy-cuda117 scikit-image tensorboard kornia timm einops -i https://pypi.tuna.tsinghua.edu.cn/simple
-
download Carla-0.9.13 from https://github.com/carla-simulator/carla/releases
-
unzip Carla and install pythonAPI via:
cd carla_root_directory/PythonAPI/carla/dist
pip install carla-0.9.13-cp37-cp37m-manylinux_2_27_x86_64.whl
- running CARLA by using:
DISPLAY= ./CarlaUE4.sh -opengl -RenderOffScreen -carla-rpc-port=12121 # headless mode
- running DMR by using:
bash auto_run_batch_modal.sh
- choices of some key parameters in
train_testm.py
:- selected_scenario: 'jaywalk', 'highbeam'
- selected_weather: 'midnight', 'hard_rain'
- perception_type:
- single-modality perception: 'RGB-Frame', 'DVS-Frame', 'DVS-Voxel-Grid', 'LiDAR-BEV', 'Depth-Frame'
- multi-modality perception: 'RGB-Frame+DVS-Frame', 'RGB-Frame+DVS-Voxel-Grid', 'RGB-Frame+Depth-Frame', 'RGB-Frame+LiDAR-BEV'
- encoder_type:
- single-modality encoder: 'pixelCarla098'
- multi-modality encoder: 'DMR_CNN', 'pixelEFNet', 'pixelFPNNet', 'pixelRENet', ...