- Helpful guide from AutoGen: https://github.com/microsoft/autogen/blob/main/notebook/agentchat_lmm_gpt-4v.ipynb
- And the blog post: https://microsoft.github.io/autogen/blog/2023/11/06/LMM-Agent/
- From AI Jason: https://github.com/JayZeeDesign/vision-agent-with-llava/blob/main/app.py

In [15]:
import os
import json
import autogen

from dotenv import load_dotenv
load_dotenv()



True

In [27]:
config_list_4v = autogen.config_list_from_json(
    "OAI_CONFIG_LIST",
    filter_dict={
        "model": ["gpt-4-vision-preview"],
    },
)


config_list_gpt4 = autogen.config_list_from_json(
    "OAI_CONFIG_LIST",
    filter_dict={
        "model": ["gpt-4-1106-preview", "gpt-4", "gpt4", "gpt-4-32k", "gpt-4-32k-0314", "gpt-4-32k-v0314"],
    },
)

gpt4_llm_config = {"config_list": config_list_gpt4}

### Describe an Image

Requires installing the `Pillow` package.

In [11]:
from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent

image_agent = MultimodalConversableAgent(
    name="image-explainer",
    max_consecutive_auto_reply=10,
    llm_config={"config_list": config_list_4v, "temperature": 0.5, "max_tokens": 300}
)

user_proxy = autogen.UserProxyAgent(
    name="User_proxy",
    system_message="A human admin.",
    human_input_mode="NEVER", # Try between ALWAYS or NEVER
    max_consecutive_auto_reply=0
)

# Ask the question with an image
user_proxy.initiate_chat(image_agent, 
                         message="""What's on the image? 
<img images/RabbitCartoonHD.png>.""")

[33mUser_proxy[0m (to image-explainer):

What's on the image? 
<image>.

--------------------------------------------------------------------------------
[31m
>>>>>>>> USING AUTO REPLY...[0m
[33mimage-explainer[0m (to User_proxy):

The image shows an adorable, stylized cartoon rabbit with large, expressive eyes and a playful smile. The rabbit has soft-looking fur, large ears, and is depicted in mid-leap against a vibrant, colorful background that suggests a serene, grassy meadow during sunrise or sunset. The sky is filled with warm colors and a few birds can be seen in the distance. There are also some flowers and plants in the foreground, adding to the peaceful and cheerful setting of the image.

--------------------------------------------------------------------------------


### Compare images

In [12]:
# Ask the question with an image
user_proxy.initiate_chat(image_agent, 
                         message="""Which image will be more appealing to a 3yo child? Why?
<img images/AIRobot.jpeg>                          
<img images/RabbitCartoonHD.png>.""")

[33mUser_proxy[0m (to image-explainer):

Which image will be more appealing to a 3yo child? Why?
<image>                          
<image>.

--------------------------------------------------------------------------------
[31m
>>>>>>>> USING AUTO REPLY...[0m
[33mimage-explainer[0m (to User_proxy):

The second image, which features a cute, cartoon-style bunny in a colorful and whimsical setting, would likely be more appealing to a 3-year-old child. Young children are often attracted to bright colors, simple shapes, and friendly, anthropomorphic animal characters that are common in children's media. The bunny's exaggerated features, such as large eyes and a cheerful expression, are designed to be engaging and endearing to young viewers.

The first image, which depicts a highly detailed and realistic robot holding a smartphone, might not be as immediately appealing to a very young child. The complexity and mechanical nature of the robot, along with the cooler color palette, may not 

In [13]:
# Ask the question with an image
user_proxy.initiate_chat(image_agent, 
                         message="""What's the facial expression of the person on the image? WHat can you tell about the person?
<img images/KO.png>.""")

[33mUser_proxy[0m (to image-explainer):

What's the facial expression of the person on the image? WHat can you tell about the person?
<image>.

--------------------------------------------------------------------------------
[31m
>>>>>>>> USING AUTO REPLY...[0m
[33mimage-explainer[0m (to User_proxy):

The person in the image is smiling and appears to be in a positive or happy mood. His arms are crossed, which can sometimes indicate confidence or self-assurance. He is wearing a light blue, button-up shirt, which suggests a business casual or professional dress code. The background is black, indicating that the photo might have been taken in a studio setting or edited to have a plain background, which is common for professional headshots or promotional materials.

--------------------------------------------------------------------------------


### Generate Images with DALL-E 3

In [51]:
from openai import OpenAI
client = OpenAI()

def generate_image(prompt):
    print("Calling the function with DALL-E 3")
    response = client.images.generate(
        model="dall-e-3",
        prompt=f"{prompt}",
        size="1024x1024",
        style="vivid",
        n=1,
    )
    image_url = response.data[0].url
    print(image_url)
    return f"<img {image_url}> TERMINATE"

Function calling for dalle assistant.

In [52]:
# Create llm config
llm_config_dalle = {
    "functions": [
        {
            "name": "generate_image",
            "description": "Use DALL-E 3 model to generate image based on a prompt, return the URL of the generated image",
            "parameters": {
                "type": "object",
                "properties": {
                    "prompt": {
                        "type": "string",
                        "description": "a descriptive prompt that describes the image in detail",
                    }
                },
                "required": ["prompt"],
            },
        },
    ],
    "config_list": config_list_gpt4,
    # "request_timeout": 120,
}

In [55]:
from autogen import AssistantAgent, UserProxyAgent
from autogen.agentchat.contrib.multimodal_conversable_agent import MultimodalConversableAgent

image_agent = MultimodalConversableAgent(
    name="image-explainer",
    max_consecutive_auto_reply=10,
    llm_config={"config_list": config_list_4v, "temperature": 0.5, "max_tokens": 300}
)

# Create assistant agent
img_gen_assistant = AssistantAgent(
    name="text_to_img_prompt_expert",
    system_message="""You are a text to image AI model expert, 
    you will use generate_image function to generate image with prompt provided, 
    and also improve prompt based on feedback provided until it is 10/10.""",
    llm_config=llm_config_dalle,
    function_map={
        "generate_image": generate_image,
    },
)

img_critic_assistant = MultimodalConversableAgent(
    name="img_critic",
    system_message="""You are an AI image critique, you will review the image created with generate_image.
     You will rate the image in scale 1-10 based on 2 criteria:
     1. Does the image meet the initial requirement of the user?
     2. Is the image suitable for 3yo kids.
     You will provide feedback on how to improve the prompt to perfectly match both criteria.""",
    llm_config={"config_list": config_list_4v},
    # function_map={
    #     "generate_image": generate_image,
    # },
)

# Create user proxy agent
admin = UserProxyAgent(
    name="user_proxy",
    human_input_mode="ALWAYS",
)

# Create groupchat
groupchat = autogen.GroupChat(
    agents=[user_proxy, img_gen_assistant, img_critic_assistant],
    messages=[],
    max_round=5,
)

manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=gpt4_llm_config)

In [56]:
user_proxy.initiate_chat(manager, message="Generate an image of a baby sheep.")

[33mUser_proxy[0m (to chat_manager):

Generate an image of a baby sheep.

--------------------------------------------------------------------------------
[33mtext_to_img_prompt_expert[0m (to chat_manager):

[32m***** Suggested function Call: generate_image *****[0m
Arguments: 
{"prompt":"a baby sheep"}
[32m***************************************************[0m

--------------------------------------------------------------------------------
[35m
>>>>>>>> EXECUTING FUNCTION generate_image...[0m
Calling the function with DALL-E 3
https://oaidalleapiprodscus.blob.core.windows.net/private/org-Xzj1yaisXt5cDNveMyg1I7wI/user-9YZozn3HQ3tv70cLgw6ILiKX/img-mgR5rYwPQebSJCnRM9681MOJ.png?st=2023-11-09T10%3A45%3A03Z&se=2023-11-09T12%3A45%3A03Z&sp=r&sv=2021-08-06&sr=b&rscd=inline&rsct=image/png&skoid=6aaadede-4fb3-4698-a8f6-684d7786b067&sktid=a48cca56-e6da-484e-a814-9c849652bcb3&skt=2023-11-09T11%3A45%3A03Z&ske=2023-11-10T11%3A45%3A03Z&sks=b&skv=2021-08-06&sig=sZnVXl7KOKMtQKUxmeKdRj5Y6Z8Xl

TypeError: string indices must be integers