Skip to content

janghyuncho/DECOLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language-conditioned Detection Transformer

Language-conditioned Detection Transformer
Jang Hyun Cho and Philipp Krähenbühl
arXiv (arXiv 2311.17902)

What is DECOLA?

We design a new open-vocabulary detection framework that adjusts the inner mechanism of the object detector to the concepts it reasons over. This language-conditioned detector (DECOLA) trains as easily as classical detectors, but generalizes much better to novel concepts. DECOLA trains in three steps: (1) Learning to condition to a set of concept. (2) pseudo-labeling image-level data to scale-up training data. (3) learning general-purpose detector for downstream open-vocabulary detection. We show strong zero-shot performance in open-vocabulary and standard LVIS benchmarks. [Full abstract]

TL;DR: We design a special detector for pseudo-labeling and scale-up open-vocabulary detection through self-training.

Please feel free to reach out for any questions or discussions!

📧 Jang Hyun Cho [email]

🔥 News 🔥

  • Added zero-shot eval configs, and fixed some config and dependency issues.
  • Added missing configs and weights.
  • Metadata uploaded.
  • Integrate Segment Anything Model (SAM) into DECOLA to generate high quality, open-vocabulary instance segmentation. (Try out!)
  • First commit.

Features

  • Detection transformer that adapts its inner-mechanism to specific classes represented in language (How?).
  • Highly accurate self-labeling to improve DETR as well as CenterNet2 and other detection frameworks.
  • State-of-the-art results on open-vocabulary LVIS (DETR and CenterNet2).
  • Highly box-efficient object detector when language-conditioned (analysis).

Installation

See installation instructions.

Demo

We provide demo based on detectron2 demo interface.

  • DECOLA has Phase 1 as language-conditioned detector and Phase 2 as general-purpose detector. Use --language-condition flag to use the phase 1 DECOLA.

  • For visualizing CenterNet2 models, use --c2 flag.

  • To use Segment Anything Model for mask generation, use --use-sam flag.

DECOLA Phase 1: Language-conditioned detection.

First, please download appropriate model checkpoint. Then, you can run demo as following

python demo.py --config-file configs/DECOLA_PHASE1_L_CLIP_SwinB_4x.yaml --input figs/input/pizza.jpg --output figs/output/pizza.jpg --vocabulary custom --custom_vocabulary cola,piza,fork,knif,table --confidence-threshold 0.3 --language-condition --opts MODEL.WEIGHTS weights/DECOLA_PHASE1_L_CLIP_SwinB_4x.pth 

Above model is DECOLA Phase 1 with Swin-B backbone (config), which has learned only from LVIS dataset. If setup properly, the output image should look like below:

Note that cola is not in LVIS vocabulary as well as piza and knif have intended typos. Similarly,

python demo.py --config-file configs/DECOLA_PHASE1_L_CLIP_SwinB_4x.yaml --input figs/input/cola.jpg --output figs/output/cola.jpg --vocabulary custom --custom_vocabulary cola,cat,mentos,table --confidence-threshold 0.3 --language-condition --opts MODEL.WEIGHTS weights/DECOLA_PHASE1_L_CLIP_SwinB_4x.pth 

Above DECOLA predicts mentos and cola successfully, which are again outside LVIS vocabulary.

DECOLA Phase 2: General-purpose detection.

General-purpose detection with Phase 2 of DECOLA is also available for both custom vocabulary

python demo.py --config-file configs/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.yaml --input figs/input/desk.jpg --output figs/output/desk1.jpg --vocabulary custom --custom_vocabulary water_bottle,wallet,webcam,mug,headphone,drawer,keyboard,laptop,straw,mouse,paper,plastic_bag --confidence-threshold 0.2 --opts MODEL.WEIGHTS weights/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.pth 

and a pre-defined vocabulary (e.g., LVIS).

python demo.py --config-file configs/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.yaml --input figs/input/desk.jpg --output figs/output/desk2.jpg --vocabulary lvis --confidence-threshold 0.2 --opts MODEL.WEIGHTS weights/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.pth 

Integrating Segment Anything Model

We combine DECOLA's powerful language-conditioned, open-vocabulary detection and Segment Anything Model (SAM). DECOLA's box output prompts SAM to generate high-quality class-aware instance segmentation. Simply install SAM and add --use-sam flag:

python demo.py --config-file configs/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.yaml --input figs/input/desk.jpg --output figs/output_sam/desk2.jpg --vocabulary lvis --confidence-threshold 0.2 --use-sam --opts MODEL.WEIGHTS weights/DECOLA_PHASE2_LI_CLIP_SwinB_4x_ft4x.pth 

Image credit: David Fouhey.

Training DECOLA

Please prepare datasets first, and follow training scripts to reproduce our results.

Testing DECOLA

Check out for all the checkpoints of our model as well as baselines.

Here are the highlight results:

Open-vocabulary LVIS with Deformable DETR

name backbone box AP_novel box mAP
baseline ResNet-50 9.4 32.2
+ self-train ResNet-50 23.2 36.2
DECOLA (ours) ResNet-50 27.6 38.3
baseline Swin-B 16.2 41.1
+ self-train Swin-B 30.8 42.3
DECOLA (ours) Swin-B 35.7 46.3
baseline Swin-L 21.9 49.6
+ self-train Swin-L 36.5 51.8
DECOLA (ours) Swin-L 46.9 55.2

Direct zero-shot transfer to LVIS minival

name backbone data AP_r AP_c AP_f mAP
DECOLA Swin-T O365, IN21K 32.8 32.0 31.8 32.0
DECOLA Swin-L O365, OID, IN21K 41.5 38.0 34.9 36.8

Direct zero-shot transfer to LVIS v1.0

name backbone data AP_r AP_c AP_f mAP
DECOLA Swin-T O365, IN21K 27.2 24.9 28.0 26.6
DECOLA Swin-L O365, OID, IN21K 32.9 29.1 30.3 30.2

Open-vocabulary LVIS with CenterNet2

name backbone box AP_novel box mAP mask AP_novel mask mAP
DECOLA ResNet-50 29.5 37.7 27.0 33.7
DECOLA Swin-B 38.4 46.7 35.3 42.0

Standard LVIS with Deformable DETR

name backbone box AP_rare box mAP
baseline ResNet-50 26.3 35.6
+ self-train ResNet-50 30.0 36.6
DECOLA (ours) ResNet-50 35.9 39.4
baseline Swin-B 38.3 44.5
+ self-train Swin-B 42.0 45.2
DECOLA (ours) Swin-B 47.4 48.3
baseline Swin-L 49.3 54.4
+ self-train Swin-L 48.7 53.4
DECOLA (ours) Swin-L 54.9 56.4

Standard LVIS with CenterNet2

name backbone box AP_rare box mAP mask AP_rare mask mAP
DECOLA (ours) ResNet-50 35.6 38.6 32.1 34.4
DECOLA (ours) Swin-B 47.6 48.5 43.7 43.6

Analyzing DECOLA

Here we provide code for analyses of our model as well as baselines.

License

The majority of DECOLA is licensed under the Apache 2.0 license. However, this work largely builds off of Detic, Deformable DETR, and Detectron2. We also provide optional integration with Segment Anything Model. Please refer to their original licenses for more details.

Citation

If you find this project useful for your research, please cite our paper using the following bibtex.

@article{cho2023language,
  title={Language-conditioned Detection Transformer},
  author={Cho, Jang Hyun and Kr{\"a}henb{\"u}hl, Philipp},
  journal={arXiv preprint arXiv:2311.17902},
  year={2023}
  }

About

Code release for "Language-conditioned Detection Transformer"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages