Skip to content
/ EVF-SAM Public

Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"

License

Notifications You must be signed in to change notification settings

hustvl/EVF-SAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📷 EVF-SAM

Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Yuxuan Zhang1,*, Tianheng Cheng1,*, Lei Liu2, Heng Liu2, Longjin Ran2, Xiaoxin Chen2, Wenyu Liu1, Xinggang Wang1,📧

1 Huazhong University of Science and Technology, 2 vivo AI Lab

(* equal contribution, 📧 corresponding author)

arxiv paper 🤗 HuggingFace models

Highlight

  • EVF-SAM extends SAM's capabilities with text-prompted segmentation, achieving high accuracy in Referring Expression Segmentation.
  • EVF-SAM is designed for efficient computation, enabling rapid inference in few seconds per image on a T4 GPU.

Updates

  • Release code
  • Release weights
  • Release demo

Visualization

Input text Input image Output
"zebra top left"
"a pizza with a yellow sign on top of it"
"the broccoli closest to the ketchup bottle"
"bus going to south common"
"3carrots in center with ice and greenn leaves"

Installation

  1. clone this repository
  2. install pytorch for your cuda version
  3. pip install -r requirements.txt

Weights

Name SAM BEIT-3 Params Reference Score
EVF-SAM SAM-H BEIT-3-L 1.32B 83.7
EVF-Effi-SAM-L EfficientSAM-S BEIT-3-L 700M 83.5
EVF-Effi-SAM-B EfficientSAM-T BEIT-3-B 232M 80.0

Inference

python inference.py  \
  --version <path to evf-sam> \
  --precision='fp16' \
  --vis_save_path "<path to your output direction>" \
  --model_type <"ori" or "effi", depending on your loaded ckpt>   \
  --image_path <path to your input image> \
  --prompt <customized text prompt>

--load_in_8bit and --load_in_4bit is optional
for example:

python inference.py  \
  --version evf-sam-21 \
  --precision='fp16' \
  --vis_save_path "infer" \
  --model_type ori   \
  --image_path "assets/zebra.jpg" \
  --prompt "zebra top left"

Demo

python demo.py <path to evf-sam>

Data preparation

Referring segmentation datasets: refCOCO, refCOCO+, refCOCOg, refCLEF (saiapr_tc-12) and COCO2014train

├── dataset
│   ├── refer_seg
│   │   ├── images
│   │   |   ├── saiapr_tc-12 
│   │   |   └── mscoco
│   │   |       └── images
│   │   |           └── train2014
│   │   ├── refclef
│   │   ├── refcoco
│   │   ├── refcoco+
│   │   └── refcocog

Evaluation

torchrun --standalone --nproc_per_node <num_gpus> eval.py   \
    --version <path to evf-sam> \
    --dataset_dir <path to your data root>   \
    --val_dataset "refcoco|unc|val"

Acknowledgement

We borrow some codes from LISA, unilm, SAM, EfficientSAM.

Citation

@article{zhang2024evfsamearlyvisionlanguagefusion,
      title={EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model}, 
      author={Yuxuan Zhang and Tianheng Cheng and Rui Hu and Lei Liu and Heng Liu and Longjin Ran and Xiaoxin Chen and Wenyu Liu and Xinggang Wang},
      year={2024},
      eprint={2406.20076},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.20076}, 
}

Releases

No releases published

Packages

No packages published

Languages