Release YOLOE — Open-Vocabulary Detection + Segmentation + Visual Prompts · john-rocky/CoreML-Models

YOLOE converted to Core ML: real-time open-vocabulary detection + instance segmentation, with text and visual prompts. Sizes S (fast) and L (accurate).

Region-embedding design: the detector is text-free and emits per-anchor region embeddings; the region-query similarity (BNContrastiveHead, folded into a 513-d augmented dot product) runs in Swift, so the vocabulary changes for free without re-running the image branch.

yoloe_detector_s/l — region-embedding detector + seg (image -> boxes, region_embeddings[1,513,8400], mask_coeffs, mask_protos)
reprta_s/l — RepRTA text-prompt refinement MLP (raw_tpe -> tpe)
visual_prompt_encoder_s/l — SAVPE visual prompt (image + 80x80 box mask -> vpe[1,1,512]); a drop-in for the text query
mobileclip_blt_text — Apple MobileCLIP B-LT text encoder (shared)
clip_vocab.json — CLIP BPE vocabulary (shared)

Demo: sample_apps/YOLOEDemo (camera/photo/video + Visual tab). License: AGPL-3.0 (THU-MIG/yoloe); MobileCLIP is Apple's export.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YOLOE — Open-Vocabulary Detection + Segmentation + Visual Prompts

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!