YOLOE converted to Core ML: real-time open-vocabulary detection + instance segmentation, with text and visual prompts. Sizes S (fast) and L (accurate).
Region-embedding design: the detector is text-free and emits per-anchor region embeddings; the region-query similarity (BNContrastiveHead, folded into a 513-d augmented dot product) runs in Swift, so the vocabulary changes for free without re-running the image branch.
- yoloe_detector_s/l — region-embedding detector + seg (image -> boxes, region_embeddings[1,513,8400], mask_coeffs, mask_protos)
- reprta_s/l — RepRTA text-prompt refinement MLP (raw_tpe -> tpe)
- visual_prompt_encoder_s/l — SAVPE visual prompt (image + 80x80 box mask -> vpe[1,1,512]); a drop-in for the text query
- mobileclip_blt_text — Apple MobileCLIP B-LT text encoder (shared)
- clip_vocab.json — CLIP BPE vocabulary (shared)
Demo: sample_apps/YOLOEDemo (camera/photo/video + Visual tab). License: AGPL-3.0 (THU-MIG/yoloe); MobileCLIP is Apple's export.