Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Introduction

We introduce the task of dense captioning in 3D scans from commodity RGB-D sensors. As input, we assume a point cloud of a 3D scene; the expected output is the bounding boxes along with the descriptions for the underlying objects. To address the 3D object detection and description problems, we propose Scan2Cap, an end-to-end trained method, to detect objects in the input scene and describe them in natural language. We use an attention mechanism that generates descriptive tokens while referring to the related components in the local context. To reflect object relations (i.e. relative spatial relations) in the generated captions, we use a message passing graph module to facilitate learning object relation features. Our method can effectively localize and describe 3D objects in scenes from the ScanRefer dataset, outperforming 2D baseline methods by a significant margin (27.61% CiDEr@0.5IoU improvement).

Please also check out the project website here.

For additional detail, please see the Scan2Cap paper:
"Scan2Cap: Context-aware Dense Captioning in RGB-D Scans"
by Dave Zhenyu Chen, Ali Gholami, Matthias Nießner and Angel X. Chang
from Technical University of Munich and Simon Fraser University.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
demo		demo
docs		docs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Introduction

Coming soon!

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Scan2Cap: Context-aware Dense Captioning in RGB-D Scans

Introduction

Coming soon!

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages