TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

This repository is an official implementation of TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes, created by Bu Jin, Yupeng Zheng, Pengfei Li, Weize Li, Yuhang Zheng, Sujie Hu, Xinyu Liu, Jinwei Zhu, Zhijie Yan, Haiyang Sun, Kun Zhan, Peng Jia, Xiaoxiao Long, Yilun Chen, Hao Zhao.

Introduction

We introduce the new task of outdoor 3D dense captioning. As input, we assume a LiDAR point cloud and a set of RGB images captured by the panoramic camera rig. The expected output is a set of object boxes with captions. To tackle this task, we propose the TOD3Cap network, which leverages the BEV representation to generate object box proposals and integrates Relation Q-Former with LLaMA-Adapter to generate rich captions for these objects. We also introduce the TOD3Cap dataset, the largest one to our knowledge for 3D dense captioning in outdoor scenes, which contains 2.3M descriptions of 64.3K outdoor objects from 850 scenes in nuScenes.

Note

This reposity will be updated soon, including:

Initialization.
Uploading the TOD3Cap Dataset.
Uploading the Annotation Tools.
Uploading the codes of TOD3Cap Network.
Uploading the Installation guidelines.
Uploading the Training and Evaluation scripts.
Uploading the Visualization scripts of gt data and predicted results.
Uploading the Baselines implementation.

Getting Started

Following https://mmdetection3d.readthedocs.io/en/latest/getting_started.html#installation a. Create a conda virtual environment and activate it.

conda create -n tod3cap python=3.8 -y 
conda activate tod3cap

b. Install PyTorch and torchvision following the official instructions.

pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
# Recommended torch>=1.9

c. Install gcc>=5 in conda env (optional).

conda install -c omgarcia gcc-6 # gcc-6.2

d. Install mmcv-full.

pip install mmcv-full==1.4.0
#  pip install mmcv-full==1.4.0 -f https://download.openmmlab.com/mmcv/dist/cu111/torch1.9.0/index.html

e. Install mmdet and mmseg.

pip install mmdet==2.14.0
pip install mmsegmentation==0.14.1

f. Install mmdet3d from source code.

git clone https://github.com/open-mmlab/mmdetection3d.git
cd mmdetection3d
git checkout v0.17.1 # Other versions may not be compatible.
python setup.py install

g. Install Detectron2 and Timm.

pip install einops fvcore seaborn iopath==0.1.9 timm==0.6.13  typing-extensions==4.5.0 pylint ipython==8.12  numpy==1.19.5 matplotlib==3.5.2 numba==0.48.0 pandas==1.4.4 scikit-image==0.19.3 setuptools==59.5.0
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

h. Install other dependencies.

pip install -r requirements.txt

Models

We release our best-performing checkpoints. You can download these models at [Google Drive] and place them under checkpoints directory. If the directory does not exist, you can create one.
We release the pretrained detector models we used during training in [Google Drive]. If you want to use other pretrained video-swin models, you can refer to BEVFormer and BEVFusion.

Dataset Preparation

You can ether download the preprocessed data in this site, or just download and preprocess the raw files following this site by yourself.

Train

We provide example scripts to train our model.

. tools/dist_train.sh

Evaluate

We provide example scripts to evaluate pre-trained checkpoints.

. tools/dist_evaluate.sh

Qualititive results

Citation

If you find our work useful in your research, please consider citing:

@article{jin2024tod3cap,
  title={TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes},
  author={Jin, Bu and Zheng, Yupeng and Li, Pengfei and Li, Weize and Zheng, Yuhang and Hu, Sujie and Liu, Xinyu and Zhu, Jinwei and Yan, Zhijie and Sun, Haiyang and others},
  journal={arXiv preprint arXiv:2403.19589},
  year={2024}
}

Acknowledgments

Our code is built on top of open-source GitHub repositories. We thank all the authors who made their code public, which tremendously accelerates our project progress. If you find these works helpful, please consider citing them as well.

open-mmlab/mmdetection3d

fundamentalvision/BEVFormer

mit-han-lab/bevfusion

OpenGVLab/LLaMA-Adapter

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

README.md

README.md

Repository files navigation

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

Introduction

Note

Table of contents

Getting Started

Models

Dataset Preparation

Train

Evaluate

Qualititive results

Citation

Acknowledgments

About

Releases

Packages

jxbbb/TOD3Cap

Folders and files

Latest commit

History

docs

docs

README.md

README.md

Repository files navigation

TOD3Cap: Towards 3D Dense Captioning in Outdoor Scenes

Introduction

Note

Table of contents

Getting Started

Models

Dataset Preparation

Train

Evaluate

Qualititive results

Citation

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages