Skip to content

YOLOE — Open-Vocabulary Detection + Segmentation + Visual Prompts

Latest

Choose a tag to compare

@john-rocky john-rocky released this 01 Jun 05:55
· 10 commits to master since this release

YOLOE converted to Core ML: real-time open-vocabulary detection + instance segmentation, with text and visual prompts. Sizes S (fast) and L (accurate).

Region-embedding design: the detector is text-free and emits per-anchor region embeddings; the region-query similarity (BNContrastiveHead, folded into a 513-d augmented dot product) runs in Swift, so the vocabulary changes for free without re-running the image branch.

  • yoloe_detector_s/l — region-embedding detector + seg (image -> boxes, region_embeddings[1,513,8400], mask_coeffs, mask_protos)
  • reprta_s/l — RepRTA text-prompt refinement MLP (raw_tpe -> tpe)
  • visual_prompt_encoder_s/l — SAVPE visual prompt (image + 80x80 box mask -> vpe[1,1,512]); a drop-in for the text query
  • mobileclip_blt_text — Apple MobileCLIP B-LT text encoder (shared)
  • clip_vocab.json — CLIP BPE vocabulary (shared)

Demo: sample_apps/YOLOEDemo (camera/photo/video + Visual tab). License: AGPL-3.0 (THU-MIG/yoloe); MobileCLIP is Apple's export.