Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

This is the official implementation of our conference paper : "Learning Mask-aware CLIP Representations for Zero-Shot Segmentation" (NeurIPS 2023).

Introduction

Recently, pre-trained vision-language models have been increasingly used to tackle the challenging zero-shot segmentation task. To maintain the CLIP's zero-shot transferability, previous practices favour to freeze CLIP during training. However, in the paper, we reveal that CLIP is insensitive to different mask proposals and tends to produce similar predictions for various mask proposals of the same image. This issue mainly relates to the fact that CLIP is trained with image-level supervision. To alleviate this issue, we propose a simple yet effective method, named Mask-aware Fine-tuning (MAFT). Specifically, Image-Proposals CLIP Encoder (IP-CLIP Encoder) is proposed to handle arbitrary numbers of image and mask proposals simultaneously. Then, mask-aware loss and self-distillation loss are designed to fine-tune IP-CLIP Encoder, ensuring CLIP is responsive to different mask proposals while not sacrificing transferability. In this way, mask-aware representations can be easily learned to make the true positives stand out. Notably, our solution can seamlessly plug into most existing methods without introducing any new parameters during the fine-tuning process.

Installation

Clone the repository

git clone https://github.com/jiaosiyu1999/MAFT.git

Navigate to the project directory
```
cd MAFT
```

Install the dependencies

bash install.sh
cd freeseg/modeling/heads/ops
sh make.sh

Data Preparation

See Preparing Datasets for MAFT. The data should be organized like:

datasets/
  ade/
      ADEChallengeData2016/
        images/
        annotations_detectron2/
      ADE20K_2021_17_01/
        images/
        annotations_detectron2/
  coco/
        train2017/
        val2017/
        stuffthingmaps_detectron2/
  VOCdevkit/
     VOC2012/
        images_detectron2/
        annotations_ovs/      
    VOC2010/
        images/
        annotations_detectron2_ovs/
            pc59_val/
            pc459_val/

Usage

Pretrained Weights

Model A-847 A-150 PC-459 PC-59 PAS-20 Weights Logs

FreeSeg 7.1 17.9 6.4 34.4 85.6 freeseg_model.pt -

MAFT-ViT-B 10.25 29.14 12.85 53.33 90.44 MAFT_Vitb.pt Log

Evaluation

evaluate trained model on validation sets of all datasets.

python train_net.py --eval-only --config-file <CONFIG_FILE> --num-gpus <NUM_GPU> OUTPUT_DIR <OUTPUT_PATH> MODEL.WEIGHTS <TRAINED_MODEL_PATH>

For example, evaluate our pre-trained model:

# 1. Download MAFT-ViT-B.
# 2. put it at `out/model.pt`.
# 3. evaluation
  python train_net.py --config-file configs/coco-stuff-164k-156/eval.yaml --num-gpus 8 --eval-only

Training

step1 train an existing "froen CLIP" network, e.g., FreeSeg:

    python train_net.py --config-file configs/coco-stuff-164k-156/mask2former_freeseg.yaml --num-gpus 4

step2 Fine-tune CLIP Image Encoder with MAFT: (Note: the the step1 model should be load.)

   python train_net.py --config-file configs/coco-stuff-164k-156/mask2former_maft.yaml --num-gpus 4

Cite

If you find it helpful, you can cite our paper in your work.

@inproceedings{jiao2023learning,
  title={Learning Mask-aware CLIP Representations for Zero-Shot Segmentation},
  author={Jiao, Siyu and Wei, Yunchao and Wang, Yaowei and Zhao, Yao and Shi, Humphrey},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
configs		configs
datasets		datasets
freeseg		freeseg
out/MAFT		out/MAFT
resources		resources
third_party/CLIP		third_party/CLIP
tools		tools
README.md		README.md
install.sh		install.sh
requirements.txt		requirements.txt
train_net.py		train_net.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

datasets

datasets

freeseg

freeseg

out/MAFT

out/MAFT

resources

resources

third_party/CLIP

third_party/CLIP

tools

tools

README.md

README.md

install.sh

install.sh

requirements.txt

requirements.txt

train_net.py

train_net.py

Repository files navigation

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

Introduction

Tab of Content

Installation

Data Preparation

Usage

Pretrained Weights

Evaluation

Training

Cite

About

Releases

Packages

Languages

Model	A-847	A-150	PC-459	PC-59	PAS-20	Weights	Logs
FreeSeg	7.1	17.9	6.4	34.4	85.6	freeseg_model.pt	-
MAFT-ViT-B	10.25	29.14	12.85	53.33	90.44	MAFT_Vitb.pt	Log

jiaosiyu1999/MAFT

Folders and files

Latest commit

History

Repository files navigation

Learning Mask-aware CLIP Representations for Zero-Shot Segmentation

Introduction

Tab of Content

Installation

Data Preparation

Usage

Pretrained Weights

Evaluation

Training

Cite

About

Resources

Stars

Watchers

Forks

Languages