Skip to content

[Issue]: Autogen with vision models like GPT-4o creates HUGE spike in usage and bill #2827

@daniel-counto

Description

@daniel-counto

Describe the issue

I created three agents to read document images, which are black and white financial documents, and are not very huge in terms of resolution (around 1k x 2k or smaller). The model I used for all of them is GPT-4o.

The flow it creates is mostly linear, i.e. from agent 1 -> agent 2 -> final agent to summarize as output.
However, for only 400 images I uploaded, it already costs me like USD200 +, and the context tokens used are about 28+ million tokens!

I wonder if this is because Autogen inserts image bits into the prompt itself? If so, shouldn't the best way is to upload the images to some place and then just insert the image path link to the prompts?

Steps to reproduce

Step1 - The agents are constructed as follows:


image_agent = MultimodalConversableAgent(
    name="image-content-extracter",
    max_consecutive_auto_reply=10,
    llm_config={"config_list": config_list_gpt4, "temperature": 0.05, "max_tokens": 1024, "cache_seed": None},
    human_input_mode="NEVER"

)

agent_1 = MultimodalConversableAgent(
    name="agent_1 ",
    system_message='''You are a helpful agent.
                    Look at the image and compare the extraction results against those extracted by image-content-extracter, and then correct them if any mistakes found.
                                       ''',
    max_consecutive_auto_reply=4,
    llm_config={"config_list": config_list_gpt4, "temperature": 0, "max_tokens": 1024, "cache_seed": None},
    human_input_mode="NEVER"
)

agent_2 = MultimodalConversableAgent(
            name="agent_2 ",
            system_message='''You are agent_1's assistant.  you put the finalized results in a JSON format.
                                              ''',
            max_consecutive_auto_reply=2,
            llm_config={"config_list": config_list_gpt4, "temperature": 0, "max_tokens": 800, "response_format":{ "type": "json_object" }, "cache_seed": None},
            human_input_mode="NEVER"
        )

coder = autogen.AssistantAgent(
    name="coding_assistant",
    system_message="Helpful coding assistant.",
    llm_config={"config_list": config_list_gpt4, "temperature": 0.1, "max_tokens": 2048},
 )

groupchat = autogen.GroupChat(agents=[user_proxy, image_agent_llava, bookkeeper, bookkeeper_assistant], messages=[], max_round=5)
group_chat_manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=gpt4_llm_config)

Step -2

user_prompt = "<this is a detailed prompt of about 1700 tokens>"

session = user_proxy.initiate_chat(
    group_chat_manager,
    message= user_prompt
)

Step-3

execute the above multiagent model, with about 500 images. Each is a standard invoice image.

Screenshots and logs

Screenshot 2024-05-28 211431
single day usage of tokens. But only about 400 images uploaded.

Additional Information

the right way to send image for OpenAI api is not sending string but this method:

 {
        "role": "user",
        "content": [
            {"type": "text", "text": 'How many bananas?'},
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/x-png;base64,{base64_image}", "detail": "low"},
            },
        ],
    }

please make the changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    0.2Issues which are related to the pre 0.4 codebasemultimodallanguage + vision, speech etc.needs-triage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions