# Agent Chat with Multi-Modality Models

Here, we use LLaVA as an example.


## LLaVA Setup
Please follow the LLaVA GitHub [page](https://github.com/haotian-liu/LLaVA/) to install LLaVA, download the weights, and start the server.

For instance, here are some important steps:
```bash
# Download the package
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA

# Install the inference package
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

# Download and serve the model
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-7b
```

Some helpful packages and dependencies:
```bash
conda install -c nvidia cuda-toolkit
```


### Launch

In one terminal, start the controller first:
```bash
python -m llava.serve.controller --host 0.0.0.0 --port 10000
```


Then, in another terminal, start the worker, which will load the model to the GPU:
```bash
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b
``

**Note: make sure the environment of this notebook also installed the llava package from `pip install -e .`**

In [1]:
import requests
import json
from llava.conversation import default_conversation as conv
from llava.conversation import Conversation

[2023-10-16 10:06:11,529] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)


In [2]:
# Setup some global constants for convenience
WORKER_ADDR = "http://0.0.0.0:40000"
CONTROLLER_ADDR = "http://0.0.0.0:10000"
SEP =  conv.sep
ret = requests.post(CONTROLLER_ADDR + "/list_models")
print(ret.json())
MODEL_NAME = ret.json()["models"][0]
print("Model Name:", MODEL_NAME)

{'models': ['llava-v1.5-13b']}
Model Name: llava-v1.5-13b


In [3]:
import base64
import re
from io import BytesIO

from PIL import Image


def extract_img_paths(paragraph: str) -> list:
    """
    Extract image paths (URLs or local paths) from a text paragraph.
    
    Parameters:
        paragraph (str): The input text paragraph.
        
    Returns:
        list: A list of extracted image paths.
    """
    # Regular expression to match image URLs and file paths
    img_path_pattern = re.compile(r'\b(?:http[s]?://\S+\.(?:jpg|jpeg|png|gif|bmp)|\S+\.(?:jpg|jpeg|png|gif|bmp))\b', 
                                  re.IGNORECASE)
    
    # Find all matches in the paragraph
    img_paths = re.findall(img_path_pattern, paragraph)
    return img_paths


def get_image_data(image_file):
    if image_file.startswith('http://') or image_file.startswith('https://'):
        response = requests.get(image_file)
        content = response.content
    elif image_file.startswith("data:image/png;base64,"):
        return image_file.replace("data:image/png;base64,", "")
    else:
        image = Image.open(image_file).convert('RGB')
        content = open(image_file, "rb").read()
    return base64.b64encode(content).decode('utf-8')
    
def _to_pil(data):
    return Image.open(BytesIO(base64.b64decode(data)))


def llava_call(prompt:str, model_name: str=MODEL_NAME, images: list=[], max_new_tokens:int=1000) -> str:
    """
    Makes a call to the LLaVA service to generate text based on a given prompt and optionally provided images.

    Args:
        - prompt (str): The input text for the model. Any image paths or placeholders in the text should be replaced with "<image>".
        - model_name (str, optional): The name of the model to use for the text generation. Defaults to the global constant MODEL_NAME.
        - images (list, optional): A list of image paths or URLs. If not provided, they will be extracted from the prompt.
            If provided, they will be appended to the prompt with the "<image>" placeholder.
        - max_new_tokens (int, optional): Maximum number of new tokens to generate. Defaults to 1000.

    Returns:
        - str: Generated text from the model.

    Raises:
        - AssertionError: If the number of "<image>" tokens in the prompt and the number of provided images do not match.
        - RunTimeError: If any of the provided images is empty.

    Notes:
    - The function uses global constants: WORKER_ADDR and SEP.
    - Any image paths or URLs in the prompt are automatically replaced with the "<image>" token.
    - If more images are provided than there are "<image>" tokens in the prompt, the extra tokens are appended to the end of the prompt.
    """
    if len(images) == 0:
        images = extract_img_paths(prompt)
        for im in images:
            prompt = prompt.replace(im, "<image>")
    else:
        # Append the <image> token if missing
        assert prompt.count("<image>") <= len(images), "the number "
        "of image token in prompt and in the images list should be the same!"
        num_token_missing = len(images) - prompt.count("<image>")
        prompt += " <image> " * num_token_missing

    
    images = [get_image_data(x) for x in images]
    
    for im in images:
        if len(im) == 0:
            raise RunTimeError("An image is empty!")
            
    headers = {"User-Agent": "LLaVA Client"}
    pload = {
        "model": model_name,
        "prompt": prompt,
        "max_new_tokens": max_new_tokens,
        "temperature": 0.5,
        "stop": SEP,
        "images": images,
    }

    response = requests.post(WORKER_ADDR + "/worker_generate_stream", headers=headers,
            json=pload, stream=False)

    for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False, delimiter=b"\0"):
        if chunk:
            data = json.loads(chunk.decode("utf-8"))
            output = data["text"].split(SEP)[-1]
    
    # Remove the prompt and the space.
    output = output.replace(prompt, "").strip().rstrip()
    
    return output


Here is the image that we are going to use.

![Image](https://github.com/haotian-liu/LLaVA/raw/main/images/llava_logo.png)

In [4]:
out = llava_call("Describe this image: <image>", 
                 images=["https://github.com/haotian-liu/LLaVA/raw/main/images/llava_logo.png"])
print(out)

This is a figurine of a red, fire-breathing, spiky-haired animal, possibly a lizard or a dragon. The figurine is made of plastic and has some orange flames coming from the top of its head. The figurine is wearing glasses and has a fire-breathing effect added to it.


In [5]:
out = llava_call("Describe this image in one sentence: https://github.com/haotian-liu/LLaVA/raw/main/images/llava_logo.png")
print(out)

This is a small red toy animal that is shaped like a llama. The toy is wearing glasses and has flames on its body. The toy is standing on a grey surface, possibly a table.


In [6]:
out = llava_call("Here is a latex formular. Can you type it out for me? <image>", 
                 images=["https://th.bing.com/th/id/OIP.koxFBp0VFEzqeiNL9diaUwHaBY?pid=ImgDet&rs=1"])

print(out)

A math equation with variables

This image shows a math equation with variables and numbers in a black and white format. The equation appears to be a combination of logarithmic and trigonometric functions, with variables such as x, y, and z. The equation is written on a white background, which emphasizes the mathematical symbols and numbers. The presence of the variables and the complex function suggests that the equation might be used in scientific or technical applications.


## AutoGen Integration: Garden Helper


Here we demonstrate a very simple multi-agent collaboration on creating visualization.

The user will upload an image of their garden, the image agent (with LLaVA backend) will read the image and describe the problem. Then, the suggestion agent (AssistantAgent with GPT model) will give suggestions on how to treat the problem.


Here, we found a problem in our garden and took a photo:
![](http://th.bing.com/th/id/R.105d684e5df7d540e61f6300d0bd374e?rik=PR8LCyvpe93DZA&pid=ImgRaw&r=0)

In [7]:
import pdb
import autogen
from autogen import AssistantAgent, Agent

config_list_gpt4 = autogen.config_list_from_json(
    "OAI_CONFIG_LIST",
    filter_dict={
        "model": ["gpt-4", "gpt-4-0314", "gpt4", "gpt-4-32k", "gpt-4-32k-0314", "gpt-4-32k-v0314"],
    },
)

llm_config = {"config_list": config_list_gpt4, "seed": 42}

In [8]:
class ImageAgent(AssistantAgent):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        
        self.register_reply([Agent, None], reply_func=ImageAgent._image_reply)
        
    def _image_reply(
        self,
        messages=None,
        sender=None, config=None
    ):
        # Note: we did not use "llm_config" yet.
        # TODO: make the LLaVA design compatible with llm_config
        if all((messages is None, sender is None)):
            error_msg = f"Either {messages=} or {sender=} must be provided."
            logger.error(error_msg)
            raise AssertionError(error_msg)

        if messages is None:
            messages = self._oai_messages[sender]

        image_name = messages[-1]["content"]
        prompt = "For the image: <image>\n\n" + self.system_message
        
        out = ""
        retry = 5
        while len(out) == 0 and retry > 0:
            out = llava_call(prompt=prompt,
                             images=[image_name, ])
            retry -= 1
            
        assert out != "", "Empty response from LLaVA."
        
        
        return True, out


image_agent = ImageAgent(
    name="image-explainer",
    system_message="What is in the image?\nHighlight the problems with the plants!\nDescribe in as many details as possible."
)

user_proxy = autogen.UserProxyAgent(
    name="User_proxy",
    system_message="A human admin.",
    code_execution_config={
        "last_n_messages": 3,
        "work_dir": "groupchat"
    },
    human_input_mode="NEVER",
    llm_config=llm_config,
)
suggestion_giver = autogen.AssistantAgent(
    name=
    "Suggestion-Giver",
    system_message="Give me treatment suggestions for my garden! You can find the description of my image from the image-explainer agent. Keep the answer concise and short.",
    llm_config=llm_config,
)

groupchat = autogen.GroupChat(agents=[user_proxy, image_agent, suggestion_giver],
                              messages=[],
                              max_round=3)
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)


# Ask the question with an image
user_proxy.initiate_chat(manager, 
                         message="http://th.bing.com/th/id/R.105d684e5df7d540e61f6300d0bd374e?rik=PR8LCyvpe93DZA&pid=ImgRaw&r=0")


[33mUser_proxy[0m (to chat_manager):

http://th.bing.com/th/id/R.105d684e5df7d540e61f6300d0bd374e?rik=PR8LCyvpe93DZA&pid=ImgRaw&r=0

--------------------------------------------------------------------------------
[33mimage-explainer[0m (to chat_manager):

In the image, there is a bunch of strawberries on a tarp. The strawberries are fresh and ripe, with some of them appearing to be overripe. They are surrounded by several green leaves, which are likely part of the strawberry plant. The tarp appears to be covering the strawberries and leaves, possibly for protection or to keep them organized for transportation or sale.

--------------------------------------------------------------------------------
[33mSuggestion-Giver[0m (to chat_manager):

1. Regularly water your plants, especially during dry seasons, but avoid over-watering.
2. Use organic mulch to regulate soil temperature, retain moisture, and prevent weed growth.
3. Prune old leaves and remove overripe fruit to encourage n