Skip to content

A collection of 3D vision and language (e.g., 3D Visual Grounding, 3D Question Answering and 3D Dense Caption) papers and datasets.

License

Notifications You must be signed in to change notification settings

jianghaojun/Awesome-3D-Vision-and-Language

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 

Repository files navigation

Awesome-3D-Vision-and-Language Awesome

A curated list of research papers in 3D visual grounding. (Contact: jhj20 at mails.tsinghua.edu.cn)

💬 News

[2022/04/15]: Create this repository.
[2022/05/25]: Expend the scope to 3D-Vision-and-Language, e.g., 3D Visual Grounding, 3D Dense Caption and 3D Question Answering.

Table of Contents

3D Visual Grounding

3D VG Paper Roadmap (Chronological Order)

ECCV 2020

  1. Achlioptas, Panos, et al. ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes. ECCV 2020, Oral. [Paper] [Code] [Website]
  2. Chen, Dave Zhenyu, et al. ScanRefer 3D Object Localization in RGB-D Scans Using Natural Language. ECCV 2020. [Paper] [Code] [Website]

AAAI 2021

  1. Huang, Pin-Hao, et al. Text-guided graph neural networks for referring 3d instance segmentation. AAAI 2021. [Paper] [Code]

CVPR 2021

  1. Feng, Mingtao, et al. Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud. CVPR 2021. [Paper] [Code]
  2. Liu, Haolin, et al. Refer-It-in-RGBD: A Bottom-Up Approach for 3D Visual Grounding in RGBD Images. CVPR 2021. [Paper] [Code] [Website]

ICCV 2021

  1. Yang, Zhengyuan, et al. SAT: 2D Semantics Assisted Training for 3D Visual Grounding. ICCV 2021, Oral. [Paper] [Code]

    Personal Notes:

    • Use corresponding 2D image data(ROI feature, label, bbox coordinates and camera pose) to assist 3D grounding.
    • Very solid experiments.
  2. Yuan, Zhihao, et al. InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring . ICCV 2021. [Paper] [Code]

  3. Zhao, Lichen, et al. 3DVG-Transformer: Relation modeling for visual grounding on point clouds. ICCV 2021. [Paper] [Code]

    Personal Notes:

    • The novelty of this paper comes from the coordinate-guied contextual aggregation module.

ACM-MM 2021

  1. He, Dailan, et al. TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding. ACM-MM 2021. [Paper]

CVPR 2022

  1. Huang, Shijia, et al. Multi-View Transformer for 3D Visual Grounding. CVPR 2022. [Paper] [Code]

    Personal Notes:

    • Rotating the center xyz of objects to provide view-related positional information before going through a Tranformer decoder.
    • SOTA results on Nr3D and Sr3D, good reuslts on ScanRefer.
  2. Luo, Junyu, et al. 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection. CVPR 2022, Oral. [Paper] [Code]

    Personal Notes:

    • First single stage work in 3D Visual Grounding !!!
    • The general idea is similar to the iterative shrinking work in 2D Visual Grounding, but the design is more elegant.
  3. Cai, Daigang, et al. 3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds. CVPR 2022.

3D VG Datasets

  1. ReferIt3D(Nr3D, Sr3D/Sr3D+): Achlioptas, Panos, et al. ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes. ECCV 2020, Oral. [Paper] [Code] [Website] [Leaderboard]

    Dataset Statistics:

    • Natural Reference in 3D (Nr3D)
    • Spatial Reference in 3D (Sr3D)
  2. ScanRefer: Chen, Dave Zhenyu, et al. ScanRefer 3D Object Localization in RGB-D Scans Using Natural Language. ECCV 2020. [Paper] [Code] [Website] [Leaderboard]

    Dataset Statistics:

    • On average, there are 13.81 objects, 64.48 descriptions per scene, and 4.67 descriptions per object.
    • Average length of descriptions is 20.27. Frequency of object attributes: spatial language (98.7%), color (74.7%), shape terms (64.9%), and size information (14.2%).

3D VG Workshops

  1. CVPR 2021 1st Workshop on Language for 3D Scenes. [Website]

3D Question Answering

3D QA Paper Roadmap (Chronological Order)

CVPR 2022

  1. Azuma, Daichi, et al. ScanQA: 3D Question Answering for Spatial Scene Understanding. CVPR 2022. [Paper] [Code]

ICLR 2023

  1. Ma, Xiaojian and Yong, Silong, et al. SQA3D: Situated Question Answering in 3D Scenes. ICLR 2023. [Paper] [Data & Code]

3D QA Datasets

  1. ScanQA: Azuma, Daichi, et al. ScanQA: 3D Question Answering for Spatial Scene Understanding. CVPR 2022. [Paper] [Data Preparation]

  2. SQA3D: Ma, Xiaojian and Yong, Silong, et al. SQA3D: Situated Question Answering in 3D Scenes. ICLR 2023. [Paper] [Data & Code]

3D Dense Caption

Pending...

About

A collection of 3D vision and language (e.g., 3D Visual Grounding, 3D Question Answering and 3D Dense Caption) papers and datasets.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published