Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision

This repository is the official implementation of Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision at CVPR 2023. Our transformer-based model, termed as OVSegmentor, is pre-trained on image-text pairs without using any mask annotations. After training, it can segment objects of arbitrary categories via zero-shot transfer.

Prepare environment

Please refer to docs/ENV_README.md

Prepare datasets

Please refer to docs/DATA_README.md

Prepare model

Please refer to docs/MODEL_README.md

Demo

To directly use the model for open-vocab segmentation:

(1) Download the pretrained model.

(2) Prepare your images, place them in the same folder, for example,

- OVSegmentor
  - visualization
    - input
      - 1.jpg
      - 2.jpg
      ...
    - output

(3) Option1 --- Perform segmentation using the vocabulary of existing classes (e.g. voc/coco/ade), simply run:

python -u -m main_demo
    --cfg configs/test_voc12.yml \
    --resume /path/to/the/pretrained_checkpoint.pth \
    --vis input_pred_label \  
    --vocab voc \
    --image_folder ./visualization/input/ \
    --output_folder ./visualization/output/ \

(3) Option2 --- Perform segmentation using the custom classes

python -u -m main_demo
    --cfg configs/test_voc12.yml \
    --resume /path/to/the/pretrained_checkpoint.pth \
    --vis input_pred_label \  
    --vocab cat dog train bus \ ### list your open-vocabulary classes here
    --image_folder ./visualization/input/ \
    --output_folder ./visualization/output/ \

You can also save the segmentation masks (e.g. for evaluation) by running

python -u -m main_demo
    --cfg configs/test_voc12.yml \
    --resume /path/to/the/pretrained_checkpoint.pth \
    --vis mask \  
    --vocab voc \
    --image_folder ./visualization/input/ \
    --output_folder ./visualization/output/ \

Training

To train the model(s) in the paper, we separate the training process as a two-stage pipeline. The first stage is a 30-epoch training with image-caption contrastive loss and masked entity completion loss, and the second-stage 10-epoch training further adds the cross-image mask consistency loss.

For the first stage training on a single node with 8 A100 (80G) GPUs, we recommand to use slurm script to enable training:

./scripts/run_slurm.sh

Or simply use torch.distributed.launch as:

./scripts/run.sh

Stage-2 training: change the 1st stage checkpoint in the 2nd stage config file. We also provide our pre-trained 1st stage checkpoint from here.

stage1_checkpoint: /path/to/your/stage1_best_miou.pth

Then, perform the second stage training.

./scripts/run_slurm_stage2.sh

Evaluation

To evaluate the model on PASCAL VOC, please specify the resume checkpoint path in tools/test_voc12.sh, and run:

./scripts/test_voc12.sh

For PASCAL Context, COCO Object, and ADE20K, please refer to ./scripts/.

The performance may vary 3%~4% due to different cross-image sampling.

Model Zoo

The pre-trained models can be downloaded from here:

Model name	Visual enc	Text enc	Group tokens	PASCAL VOC	PASCAL Context	COCO Object	ADE20K	Checkpoint
OVSegmentor	ViT-B	BERT-Base	8	53.8	20.4	25.1	5.6	download
OVSegmentor	ViT-S	Roberta-Base	8	44.5	18.3	19.0	4.3	download
OVSegmentor	ViT-B	BERT-Base	16	Todo	Todo	Todo	Todo	Todo

Citation

If this work is helpful for your research, please consider citing us.

@inproceedings{xu2023learning,
  title={Learning open-vocabulary semantic segmentation models from natural language supervision},
  author={Xu, Jilan and Hou, Junlin and Zhang, Yuejie and Feng, Rui and Wang, Yi and Qiao, Yu and Xie, Weidi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2935--2944},
  year={2023}
}

Acknowledgements

This project is built upon GroupViT. Thanks to the contributors of the great codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
convert_dataset		convert_dataset
datasets		datasets
docs		docs
figs		figs
imagenet_info		imagenet_info
models		models
scripts		scripts
segmentation		segmentation
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main_demo.py		main_demo.py
main_pretrain.py		main_pretrain.py
main_seg.py		main_seg.py
requirements.txt		requirements.txt
setup.cfg		setup.cfg

License

Jazzcharles/OVSegmentor

Folders and files

Latest commit

History

Repository files navigation

Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision

Prepare environment

Prepare datasets

Prepare model

Demo

Training

Evaluation

Model Zoo

Citation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Languages