# Agent Chat with Multi-Modality Models

Here, we use LLaVA as an example.


## LLaVA Setup
Please follow the LLaVA Github [page](https://github.com/haotian-liu/LLaVA/) to install the LLaVA, download weights, and start the server.

For instance, here are some important steps:
```bash
# Download package
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA

# Install inference package
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

# Download and serve the model
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-7b
```

Some helpful packages and dependencies:
```bash
conda install -c nvidia cuda-toolkit
```


### Launch

In one terminal, start the controller
```bash
python -m llava.serve.controller --host 0.0.0.0 --port 10000
```


In another terminal, start the worker, which will load the model to GPU
```bash
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path liuhaotian/llava-v1.5-13b
```

**Note: make sure the environment of this notebook also installed the llava package from `pip install -e .`**

In [1]:
import requests
import json
from llava.conversation import default_conversation as conv
from llava.conversation import Conversation

[2023-10-13 17:31:15,781] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)


In [2]:
# Setup some global constants for convenience
WORKER_ADDR = "http://0.0.0.0:40000"
CONTROLLER_ADDR = "http://0.0.0.0:10000"
SEP =  conv.sep
ret = requests.post(CONTROLLER_ADDR + "/list_models")
MODEL_NAME = ret.json()["models"][0]
print("Model Name:", MODEL_NAME)

Model Name: llava-v1.5-13b


In [3]:
import base64
from io import BytesIO

from PIL import Image

def get_image_data(image_file):
    if image_file.startswith('http://') or image_file.startswith('https://'):
        response = requests.get(image_file)
        content = response.content
    elif image_file.startswith("data:image/png;base64,"):
        return image_file.replace("data:image/png;base64,", "")
    else:
        image = Image.open(image_file).convert('RGB')
        content = open(image_file, "rb").read()
    return base64.b64encode(content).decode('utf-8')
    
def _to_pil(data):
    return Image.open(BytesIO(base64.b64decode(data)))


def llava_call(prompt:str, model_name: str=MODEL_NAME, images: list=[], max_new_tokens:int=1000) -> str:
    assert prompt.count("<image>") == len(images), "the number "
    "of image token in prompt and in the images list should be the same!"
    
    images = [get_image_data(x) for x in images]
    
    for im in images:
        if len(im) == 0:
            raise RunTimeError("An image is empty!")
            
    headers = {"User-Agent": "LLaVA Client"}
    pload = {
        "model": model_name,
        "prompt": prompt,
        "max_new_tokens": max_new_tokens,
        "temperature": 0.5,
        "stop": SEP,
        "images": images,
    }

    response = requests.post(WORKER_ADDR + "/worker_generate_stream", headers=headers,
            json=pload, stream=False)

    for chunk in response.iter_lines(chunk_size=8192, decode_unicode=False, delimiter=b"\0"):
        if chunk:
            data = json.loads(chunk.decode("utf-8"))
            output = data["text"].split(SEP)[-1]
    
    # Remove the prompt and the space.
    output = output.replace(prompt, "").strip().rstrip()
    
    return output

Here is the image that we are going to use.

![Image](https://github.com/haotian-liu/LLaVA/raw/main/images/llava_logo.png)

In [4]:
out = llava_call("Describe this image: <image>", 
                 images=["https://github.com/haotian-liu/LLaVA/raw/main/images/llava_logo.png"])
print(out)

In this image, a small toy is on display. The toy is an orange, flame-covered animal, resembling a lizard or a small horse. It is sitting on a table, and its fur is set on fire. The toy has a pair of red glasses on its face, which adds to its unique appearance. The overall scene is quite captivating and fun, with the toy's flames giving it a distinctive and eye-catching look.


## AutoGen Integration: Garden Helper


Here we demonstrate a very simple multi-agent collaboration on creating visualization.

The user will upload an image of their garden, the image agent (with LLaVA backend) will read the image and describe the problem. Then, the suggestion agent (AssistantAgent with GPT model) will give suggestions on how to treat the problem.


Here, we found a problem in our garden and took a photo:
![](http://th.bing.com/th/id/R.105d684e5df7d540e61f6300d0bd374e?rik=PR8LCyvpe93DZA&pid=ImgRaw&r=0)

In [5]:
import pdb
import autogen
from autogen import AssistantAgent, Agent

config_list_gpt4 = autogen.config_list_from_json(
    "OAI_CONFIG_LIST",
    filter_dict={
        "model": ["gpt-4", "gpt-4-0314", "gpt4", "gpt-4-32k", "gpt-4-32k-0314", "gpt-4-32k-v0314"],
    },
)

llm_config = {"config_list": config_list_gpt4, "seed": 42}

In [None]:
class ImageAgent(AssistantAgent):

    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        
        self.register_reply([Agent, None], reply_func=ImageAgent._image_reply)
        
    def _image_reply(
        self,
        messages=None,
        sender=None, config=None
    ):
        # Note: we did not use "llm_config" yet.
        # TODO: make the LLaVA design compatible with llm_config
        if all((messages is None, sender is None)):
            error_msg = f"Either {messages=} or {sender=} must be provided."
            logger.error(error_msg)
            raise AssertionError(error_msg)

        if messages is None:
            messages = self._oai_messages[sender]

        image_name = messages[-1]["content"]
        prompt = "For the image: <image>\n\n" + self.system_message
        out = llava_call(prompt=prompt,
                         images=[image_name, ])

        print(out)
        assert out != "", "Empty response from LLaVA."
        
        
        return True, out


image_agent = ImageAgent(
    name="image-explainer",
    system_message="What is in the image?\nHighlight the problems with the plants!\nDescribe in as many details as possible."
)

user_proxy = autogen.UserProxyAgent(
    name="User_proxy",
    system_message="A human admin.",
    code_execution_config={
        "last_n_messages": 3,
        "work_dir": "groupchat"
    },
    human_input_mode="NEVER",
    llm_config=llm_config,
)
suggestion_giver = autogen.AssistantAgent(
    name=
    "Suggestion-Giver",
    system_message="Give me treatment suggestions for my garden! You can find the description of my image from the image-explainer agent. Keep the answer concise and short.",
    llm_config=llm_config,
)

groupchat = autogen.GroupChat(agents=[user_proxy, image_agent, suggestion_giver],
                              messages=[],
                              max_round=3)
manager = autogen.GroupChatManager(groupchat=groupchat, llm_config=llm_config)


# Ask the question with an image
user_proxy.initiate_chat(manager, 
                         message="http://th.bing.com/th/id/R.105d684e5df7d540e61f6300d0bd374e?rik=PR8LCyvpe93DZA&pid=ImgRaw&r=0")


[33mUser_proxy[0m (to chat_manager):

http://th.bing.com/th/id/R.105d684e5df7d540e61f6300d0bd374e?rik=PR8LCyvpe93DZA&pid=ImgRaw&r=0

--------------------------------------------------------------------------------
