Set-of-Mark Prompting

### Is your feature request related to a problem? Please describe.

Set-of-Mark Prompting unlocks better control of multi-modal models like GPT-4V. The authors present Set-of-Mark (SoM), a new visual prompting method, to enhance the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V. This method involves partitioning an image into regions at different levels of granularity and overlaying these regions with marks like alphanumerics, masks, and boxes. This enables GPT-4V to answer questions requiring visual grounding more effectively. The authors' experiments show significant improvements in tasks like referring expression comprehension and segmentation.


### Describe the solution you'd like

I propose the integration of Set-of-Mark Prompting with the `OS` mode in Open Interpreter. This enhancement would enable more effective handling of multi-modal tasks and leverage the existing capabilities of Open Interpreter in a more versatile and powerful way.

### Describe alternatives you've considered

_No response_

### Additional context

- Research paper: [SoM Method](https://arxiv.org/abs/2310.11441)
- Source code by @microsoft: [SoM Repo](https://github.com/microsoft/SoM)
- Implementation example by @joshbickett: [GitHub PR](https://github.com/OthersideAI/self-operating-computer/pull/125)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set-of-Mark Prompting #898

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Set-of-Mark Prompting #898

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions