📷 EVF-SAM

Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Yuxuan Zhang^1,*, Tianheng Cheng^1,*, Lei Liu², Heng Liu², Longjin Ran², Xiaoxin Chen², Wenyu Liu¹, Xinggang Wang^1,📧

¹ Huazhong University of Science and Technology, ² vivo AI Lab

(* equal contribution, 📧 corresponding author)

Highlight

EVF-SAM extends SAM's capabilities with text-prompted segmentation, achieving high accuracy in Referring Expression Segmentation.
EVF-SAM is designed for efficient computation, enabling rapid inference in few seconds per image on a T4 GPU.

Updates

Release code
Release weights
Release demo

Visualization

Input text	Input image	Output
"zebra top left"
"a pizza with a yellow sign on top of it"
"the broccoli closest to the ketchup bottle"
"bus going to south common"
"3carrots in center with ice and greenn leaves"

Installation

clone this repository
install pytorch for your cuda version
pip install -r requirements.txt

Weights

Name	SAM	BEIT-3	Params	Reference Score
EVF-SAM	SAM-H	BEIT-3-L	1.32B	83.7
EVF-Effi-SAM-L	EfficientSAM-S	BEIT-3-L	700M	83.5
EVF-Effi-SAM-B	EfficientSAM-T	BEIT-3-B	232M	80.0

Inference

python inference.py  \
  --version <path to evf-sam> \
  --precision='fp16' \
  --vis_save_path "<path to your output direction>" \
  --model_type <"ori" or "effi", depending on your loaded ckpt>   \
  --image_path <path to your input image> \
  --prompt <customized text prompt>

--load_in_8bit and --load_in_4bit is optional
for example:

python inference.py  \
  --version evf-sam-21 \
  --precision='fp16' \
  --vis_save_path "infer" \
  --model_type ori   \
  --image_path "assets/zebra.jpg" \
  --prompt "zebra top left"

Demo

python demo.py <path to evf-sam>

Data preparation

Referring segmentation datasets: refCOCO, refCOCO+, refCOCOg, refCLEF (saiapr_tc-12) and COCO2014train

├── dataset
│   ├── refer_seg
│   │   ├── images
│   │   |   ├── saiapr_tc-12 
│   │   |   └── mscoco
│   │   |       └── images
│   │   |           └── train2014
│   │   ├── refclef
│   │   ├── refcoco
│   │   ├── refcoco+
│   │   └── refcocog

Evaluation

torchrun --standalone --nproc_per_node <num_gpus> eval.py   \
    --version <path to evf-sam> \
    --dataset_dir <path to your data root>   \
    --val_dataset "refcoco|unc|val"

Acknowledgement

We borrow some codes from LISA, unilm, SAM, EfficientSAM.

Citation

@article{zhang2024evfsamearlyvisionlanguagefusion,
      title={EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model}, 
      author={Yuxuan Zhang and Tianheng Cheng and Rui Hu and Lei Liu and Heng Liu and Longjin Ran and Xiaoxin Chen and Wenyu Liu and Xinggang Wang},
      year={2024},
      eprint={2406.20076},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.20076}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📷 EVF-SAM

Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Highlight

Updates

Visualization

Installation

Weights

Inference

Demo

Data preparation

Evaluation

Acknowledgement

Citation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
model		model
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
eval.py		eval.py
inference.py		inference.py
requirements.txt		requirements.txt

License

hustvl/EVF-SAM

Folders and files

Latest commit

History

Repository files navigation

📷 EVF-SAM

Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Highlight

Updates

Visualization

Installation

Weights

Inference

Demo

Data preparation

Evaluation

Acknowledgement

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages