Code Implementation for CVPR2024 PAPER -- Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models (PnP-OVSS)
❗ Only the code for BLIP with Pascal Context is provided here.
CUDA Version: 11.7
GPU Memory: 49140MiB
Build LAVIS environment following the instruction here
conda create -n lavis python=3.8
conda activate lavis
pip install salesforce-lavis
git clone https://github.com/salesforce/LAVIS.git
cd LAVIS
pip install -e .
This would download the newest torch, may need to modify torch version based on your cuda version.
Might also need to downgrade transformer
pip install transformers==4.25
Download Gradient_Free_Optimizers_master and put it under LAVIS (This is for Random Search. Can ignore it for now)
Git clone pydensecrf and put it under LAVIS
Pascal VOC
Pascal Context: Download dataset following instruction from mmsegmentation
COCO Object
COCO Stuff
ADE20K
Cityscapes: Download dataset following instruction from mmsegmentation
LAVIS
├── mmsegmentation
│ ├── VOCdevkit
│ │ ├── VOC2010
│ │ │ ├── JPEGImages
│ │ │ ├── SegmentationClassContext
Download all the files in this repository and put them under LAVIS
Replace /home/user/LAVIS/lavis/models/blip_models/blip_image_text_matching.py with the file in this repository
Replace /home/user/LAVIS/lavis/configs/models/blip_itm_large.yaml with the file in this repository
Replace /home/user/LAVIS/lavis/models/med.py with the file in this repository
Replace /home/user/LAVIS/lavis/models/vit.py with the file in this repository
Replace /home/user/LAVIS/lavis/models/base_model.py with the file in this repository
Replace /home/user/LAVIS/lavis/processors/blip_processors.py with the file in this repository
For Pascal Context
bash PSC_halving.sh
For COCO Object and COCO stuff
bash New_eval_cam_PSC.sh
The output would have the following structure
LAVIS
├── New_Cbatch_Eval_test_ddp_0126_768_flickrfinetune_zeroshot_halvingdrop_Cityscapes
│ ├── gradcam
│ │ ├── max_att_block_num8_del_patch_numsort_thresh005
│ │ │ ├── drop_iter0
│ │ │ │ ├──img_att_forclasses (attention map)
│ │ │ │ ├──Union_check0928 (visualization of attention map)
│ │ │ │ ├──highest_att_save (index of patches to be dropped)
│ │ │ ├── drop_iter1
│ │ │ ├── drop_iter2
│ │ │ ├── drop_iter3
│ │ │ ├── drop_iter4
CUDA_VISIBLE_DEVICES=3 python pnp_get_attention_textloc_weaklysupervised_search_Cityscapes.py \
--save_path New_Cbatch_Eval_test_ddp_0126_448_flickrfinetune_zeroshot_halvingdrop_Cityscapes \
--master_port 10990 --gen_multiplecap_withpnpvqa label --world_size 1 \
--del_patch_num sort_thresh005 \
--img_size 768 \
--batch_size 2 \
--max_att_block_num 8 --drop_iter 5 --prune_att_head 9 --sort_threshold 0.05
To change image size, you may also need to modify the image size in /home/user/LAVIS/lavis/configs/models/blip_itm_large.yaml
Remember to match the save_path in {xxx}_halving.sh with the cam_out_dir in New_eval_cam_{xx}.sh