-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Is your feature request related to a problem? Please describe.
Set-of-Mark Prompting unlocks better control of multi-modal models like GPT-4V. The authors present Set-of-Mark (SoM), a new visual prompting method, to enhance the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. This method involves partitioning an image into regions at different levels of granularity and overlaying these regions with marks like alphanumerics, masks, and boxes. This enables GPT-4V to answer questions requiring visual grounding more effectively. The authors' experiments show significant improvements in tasks like referring expression comprehension and segmentation.
Describe the solution you'd like
I propose the integration of Set-of-Mark Prompting with the OS
mode in Open Interpreter. This enhancement would enable more effective handling of multi-modal tasks and leverage the existing capabilities of Open Interpreter in a more versatile and powerful way.
Describe alternatives you've considered
No response
Additional context
- Research paper: SoM Method
- Source code by @microsoft: SoM Repo
- Implementation example by @joshbickett: GitHub PR