In [2]:
# Just some tricks to look like we are running comands from the root directory

# Get the current working directory
import os
import sys

# Add parent directory to system path (2 levels up from current directory)
current_dir = os.getcwd()
parent_dir = os.path.abspath(os.path.join(current_dir, "../.."))
sys.path.append(parent_dir)
import os

# Change the working directory to the project root (assuming the root is 2 levels up)
os.chdir(os.path.abspath(os.path.join(os.getcwd(), "../..")))
print("Working directory set to:", os.getcwd())

Working directory set to: /home/mashalimay/webarena/modular_agent


In [11]:
# Enable the autoreload extension
%load_ext autoreload
%autoreload 2

# Other Imports
from PIL import Image

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1) Overview

The idea of th `llms` package is to make it very simple to send inference to LLMs. 
In summary: it allows inference calls to any of the supported providers and engines with something as simple as: `messages=list[images, text, video, function_call, ...]` and `generation_configs = {k:v, k:v,..}`. 


And behind the scenes, the package handles:
- Formatting of prompts for providers and models.
- Setting up clients, inference engines for HuggingFace
- Routing the inference call to the appropriate provider and engine
- Load balancing API keys
- Retry with exponential backoff with customized error logic
- Logging of inputs and outputs, including: HTML visualization of the prompt for debugging; token counts; conversation logs
- Return of multimodal outputs in a unified format independent of the provider, model, or provider mode.
- Type validation and handling of multiple media types
- Etc, etc

## 2) Setting API keys

Two alternatives:

#### 1) Set the API keys in the environment variables.
```bash
export OPENAI_API_KEY="<key >"
export GOOGLE_API_KEY="<key>"
export HF_TOKEN="<key>
```
In this option, **no load balancing** of the keys is performed in case of multiple calls.

#### 2)(Recommended) Create a `api_keys.json` file  as follows:

api_keys.json
```json
{
    "google": ["key1", "key2", "..."],
    "openai": ["key1", "key2", "..."],
    "huggingface": ["key1", "key2", "..."]
}
```

With this option, a load balancing of the keys will be performed and in case of quota limit errors, other keys will be tried.

**NOTE**: Keep also a `api_keys_repo.json` as a backup. Why: in case of concurrent processes, they will fetch keys from `api_keys.json` and remove keys for load balancing purposes and if something unexpected happens, they may not be returned to the file 

(though the code have fallbacks to return the keys if the processes are killed, many things can go wrong)

## 3) Call LLM

`call_llm` is the main function of interest. It receives: "messages" to send to the model and "generation config" to specifiy: (i) the model's behavior (ii) the inference proviers/engines.

Below a series of examples on how to use it.

In [2]:
from llms.llm_utils import call_llm, get_gen_config_fields

  from .autonotebook import tqdm as notebook_tqdm


The command below lists of all possible parameters to control model behavior setting up providers/engines. 

Obs.: The output is not pretty yet and some parameters are provider/engine specific. Please check `llms.generation_config.py` for more details on each one. Examples that follows illustrate the main ones



In [None]:
get_gen_config_fields()

#### Call to OpenAI, Google Model

Below is one example of how the inputs can be provided for inference call. 

There are many variations possible. This file will include more in the future

In [4]:
inputs=[
        {"role": "system", "text": "You are an intelligent and helpful assistant."}, # dict with a role and text
        "Describe **all** the below items.", # raw string
        ["Item (1):", "llms/examples/cat.png"], # A list with a prefix text and a file to an image (both are sent in the same message)
        ["Item (2):", Image.open("llms/examples/dog.png")], # A list with a prefix text and a PIL image (both are sent in the same message)
        ["Item (3):", "Once upon a time, there was a princess who lived in a castle."], # A list with only a text input
        "Provide your response as follows: <Title for Item 1> <Description for Item 1> <Title for Item 2> <Description for Item 2> <Title for Item 3> <Description for Item 3>"
  ]

# **NOTE:** Each of the `inputs` entries is a `message`. If that sounds confusing / ambiguous, check the file `llms.types.py` or continue reading.
# In short: LLMs receive a series of `message` objects, where 
# a `message` object contains multiple raw inpus (such as images, text and video). 
# The full prompt to the llm is a list of those messages.

We now define a minimum set of generation configs and call Gemini:

In [5]:
gen_args = {
    "model": "gemini-2.0-flash-001",
    "temperature": 0.5,
    "max_tokens": 1000,
    "top_p": 0.95,
    "top_k": 40,
    "num_generations": 1,
}
conversation_dir = "llms/examples/conversation"
usage_dir = "llms/examples/usage"
response, model_generations = call_llm(gen_args, inputs, conversation_dir=conversation_dir, usage_dir=usage_dir)    

CALLING MODEL: `gemini-2.0-flash-001`: generating 1 outputs...


Created new executor for llms.providers.google.google_utils.sync_api_call


After this call, we have
- A list of `response` objects; these are dictionaries with data about the API request
- A list of `model_generations`; these are `Message` objects containing the model's raw outputs (text, images, etc).
- `html` and `txt` logs of the conversation round in the `llms/examples/conversation` directory
- `csv` files with token usage information in the `llms/examples/usage` directory

In [6]:
# Just to see how the returned objects look like
print(model_generations)
print(response)

[Message(role='assistant', contents=[ContentItem(type='text', data='**Item 1: Tabby Cat**\nThe image shows a tabby cat sitting upright. It has brown and black stripes, green eyes, and a pink nose. The cat is sitting on a ledge or wall, and there are bare branches in the background.\n\n**Item 2: Golden Retriever Puppy**\nThe image shows a golden retriever puppy sitting in a grassy field. The puppy has a light golden coat and an open mouth, as if panting or smiling. There are orange flowers scattered in the background.\n\n**Item 3: Story Starter**\nThe text is the beginning of a fairy tale: "Once upon a time, there was a princess who lived in a castle." This is a classic opening line for a children\'s story, setting the scene for a narrative about royalty and potentially adventure.\n', meta_data={}, id=None)], name='', meta_data={})]
[{'candidates': [{'content': {'parts': [{'video_metadata': None, 'thought': None, 'code_execution_result': None, 'executable_code': None, 'file_data': None,

The `Message` object is a unified format for both **inputs** and **outputs** of a user-model conversation. More details of it in `llms.types`, but in summary:
- A single message contains: (i) a `role` that identifies the entity sending the information; (ii) data ('text', 'images', etc) sent by the entity
- A conversation is a list of `Message` items.

**NOTE** we didn't specify the role of many of the entries in the `inputs` above. In this case, they will be assumed to be role `user`. There are a couple ways to change this behavior which we'll see below

Below some methods to access the `Message`s raw data:

In [7]:
# Get all text content within a message
model_generations[0].text()

'**Item 1: Tabby Cat**\nThe image shows a tabby cat sitting upright. It has brown and black stripes, green eyes, and a pink nose. The cat is sitting on a ledge or wall, and there are bare branches in the background.\n\n**Item 2: Golden Retriever Puppy**\nThe image shows a golden retriever puppy sitting in a grassy field. The puppy has a light golden coat and an open mouth, as if panting or smiling. There are orange flowers scattered in the background.\n\n**Item 3: Story Starter**\nThe text is the beginning of a fairy tale: "Once upon a time, there was a princess who lived in a castle." This is a classic opening line for a children\'s story, setting the scene for a narrative about royalty and potentially adventure.\n'

In [8]:
# Get all images within a message (there is none in this case because this model outputs only text)
model_generations[0].images()

[]

In [9]:
# A list with interleaved text, image, video, etc.
model_generations[0].raw_data()

['**Item 1: Tabby Cat**\nThe image shows a tabby cat sitting upright. It has brown and black stripes, green eyes, and a pink nose. The cat is sitting on a ledge or wall, and there are bare branches in the background.\n\n**Item 2: Golden Retriever Puppy**\nThe image shows a golden retriever puppy sitting in a grassy field. The puppy has a light golden coat and an open mouth, as if panting or smiling. There are orange flowers scattered in the background.\n\n**Item 3: Story Starter**\nThe text is the beginning of a fairy tale: "Once upon a time, there was a princess who lived in a castle." This is a classic opening line for a children\'s story, setting the scene for a narrative about royalty and potentially adventure.\n']

In [10]:
# A dict with format similar to OpenAI chat completion format
model_generations[0].to_dict()


{'contents': [{'type': 'text',
   'data': '**Item 1: Tabby Cat**\nThe image shows a tabby cat sitting upright. It has brown and black stripes, green eyes, and a pink nose. The cat is sitting on a ledge or wall, and there are bare branches in the background.\n\n**Item 2: Golden Retriever Puppy**\nThe image shows a golden retriever puppy sitting in a grassy field. The puppy has a light golden coat and an open mouth, as if panting or smiling. There are orange flowers scattered in the background.\n\n**Item 3: Story Starter**\nThe text is the beginning of a fairy tale: "Once upon a time, there was a princess who lived in a castle." This is a classic opening line for a children\'s story, setting the scene for a narrative about royalty and potentially adventure.\n',
   'meta_data': {},
   'id': None}],
 'role': 'assistant',
 'name': '',
 'meta_data': {}}

Now suppose we want to send another query with the previous inputs + the model response + a new request.

To make things more interesting, lets send this to **GPT4o** now.

Below we construct this new input using the previous list of `inputs` and showing some new ways of providing inputs

In [11]:
# Construct the new inputs
new_inputs = inputs + [
    model_generations[0], # The Message object can be sent directly as input too; notice it contains the ROLE of the entity!
    {"role": "user", "text": "Please give an opinion of the above conversation. How do you evaluate the assistant's performance?"}
]

You can visualize the prompt before sending for a sanity check by using the `visualize_prompt` tool.

In [12]:
from llms.llm_utils import visualize_prompt
output_path = "llms/examples/vis.html"
visualize_prompt(new_inputs, output_path)

This commands save an `.html` file with the messsages as they will be received by the model. 

Open it in a browser for visualization and to sanity check if the order of messages, roles, entitiy names, etc is correct. 

Run the cell below to check how it looks like.

In [None]:
from IPython.display import display, HTML

# Read the HTML content from the file
with open("llms/examples/vis.html", "r") as file:
    html_content = file.read()

# Display the content inline in the notebook
display(HTML(html_content))

After making sure the prompt is correct, we can send it to GPT4o. 
- For that, we only need to change the `model` parameter in the previous generation arguments.
- There is no need to adjust parameter names or values to abide to the new provider. 
- Same thing for the prompt formats!


In [None]:
# We only change the model name in the generation config.
gen_args["model"] = "gpt-4o-2024-08-06"

# You can add a `call_id` to save the conversation and usage logs with a specific name.
response, model_generations = call_llm(gen_args, new_inputs, conversation_dir=conversation_dir, usage_dir=usage_dir, call_id="gpt4o_call")

In [17]:
model_generations[0].text()

"The assistant's performance was accurate and concise. It provided clear descriptions of the images and text, adhering to the requested format. The descriptions for the cat and puppy were detailed, capturing key features and setting. The story starter was correctly identified as a classic fairy tale opening. Overall, the response was well-structured and informative."

**TODO**: add more examples:
- OpenAI's `response` API

### Call HuggingFace

Same process to call models from HuggingFace's, except that:

- (i) We need to specify some more arguments such as "engine" to deploy the model
    - Supported engines are: `autmodel`, `server`, and `vllm`. Details below.
- (ii) There is a higher likelihood of bugs; many models in HuggingFace have model-specific quirks and it is impossible to foresee all them.
    - The code will do the best effort to process the inputs and generate the outputs. But for instance, `Qwen-2.5-VL` was not supported by the `Automodel` class so there is a specific handling of model loading and generation that is hard to automate. 
    - Moreover, some models have specific prompts that are not always covered by the `apply_chat_template`. 
    - etc
- We can also specify other args like: which resources to use (e.g.: CPU, GPU, etc); if quantize or not; etc. See `llms.generation_config.py` for all HF-specifc args.

Below examples make an inference call to `Qwen-2.5-VL-3B` using the three engines. 

Below is the same `inputs` as above, but with other examples of ways to send each input.

In [3]:
inputs=[
        {"role": "system", "text": "You are an intelligent and helpful assistant."}, 
        "Describe **all** the below items.",
        {"role": "user", "text": "Item (1):", "image": "llms/examples/cat.png"}, # Another way to send an input
        {"role": "user", "contents":[{"type": "text", "text": "Item (2):"}, {"type": "image", "image": "llms/examples/dog.png"}]}, #OpenAI chat completion format
        "Item (3): Once upon a time, there was a princess who lived in a castle."
        "Provide your response as follows: <Title for Item 1> <Description for Item 1> <Title for Item 2> <Description for Item 2> <Title for Item 3> <Description for Item 3>"
  ]


In [None]:
# Run this cell to visualize the prompt

visualize_prompt(inputs, "llms/examples/vis_hf.html")
# Read the HTML content from the file
with open("llms/examples/vis.html", "r") as file:
    html_content = file.read()

# Display the content inline in the notebook
display(HTML(html_content))

#### Hugging Face - Automodel Engine


This mode is the same as the vanilla usage of hugging face; the model is available only to the current process.

In [None]:
gen_args = {
    "model": "Qwen/Qwen2.5-VL-3B-Instruct",
    "engine": "automodel",
    "num_generations": 1,
    "temperature": 0.5,
    "max_tokens": 1000,
    "top_p": 0.95,
    "top_k": 40,
    "repetition_penalty": 1.05,
}
conversation_dir = "llms/examples/conversation"
usage_dir = "llms/examples/usage"
responses, model_generations = call_llm(gen_args, inputs, conversation_dir=conversation_dir, usage_dir=usage_dir)

[/home/mashalimay/webarena/modular_agent/llms/providers/hugging_face/hf_utils.py] CALLING MODEL: `Qwen/Qwen2.5-VL-3B-Instruct` with engine `automodel`: generating 1 output(s)...


NOTES:
- By default: 
    - `device_map=auto`. Set `device:<device>` to override. #TODO: allow dict with `device_map`;
    - Use `flash_attn` if it is available. To disable, set `flash_attn:False`
    - Set `dtype` based on the model information and if not found, it set to `auto`. Set `dtype` to override.
- Behind the scenes, the prompts are converted to an OpenAI chat completions format that HF uses. Check them via `responses[idx]["prompt"]`

In [14]:
model_generations[0].text()

"<Title for Item 1>: Tabby Cat\n\n<Description for Item 1>: The image shows a tabby cat with a striped coat sitting on a ledge. The cat has a mix of dark and light fur patterns, with green eyes. It appears to be looking directly at the camera with a calm and alert expression.\n\n<Title for Item 2>: Golden Retriever\n\n<Description for Item 2>: The image features a golden retriever dog standing on a grassy field. The dog is looking up, possibly at something interesting in the sky or in the distance. The background is filled with orange flowers, creating a vibrant and colorful scene.\n\n<Title for Item 3>: Princess in a Castle\n\n<Description for Item 3>: This item is a story setting that describes a princess living in a castle. The princess is the central character in this narrative, and her life within the castle is the focus of the story. The castle provides a backdrop of grandeur and mystery, with its tall walls and intricate architecture. The princess's daily life, adventures, and r

#### Hugging Face - Local Server Engine

The `server` engine makes model available at an `endpoint`, so multiple processes can send inference requests without using multiple GPUs.

There are two ways to deploy in this mode:

**Option 1: (Recommeded) Host the model first, then send inference calls with `call_llm`**


1. Run:

 ```bash
 python -m llms.providers.hugging_face.host_model_hf "Qwen/Qwen2.5-VL-3B-Instruct" --host <host> --port <port>
 ```

2. Add `engine:server` and `<host>:<port>` in `gen_args`

**NOTE**: If hosting in `machineA` and accessing model via `machineB`: execute step 1 in machineA; to `call_llm` from `machineB`, set `host` to the IP of machineA.


**Option 2:  Directly call `call_llm` with `engine:server` and `localhost:<port>` in `gen_args`.**
- This will automatically host the model if possible, using the same script `llms.providers.hugging_face.host_model_hf`
- It is less recommended as:
    - The process hosting the model will die if the first process that calls `call_llm` ends
    - For new models, weights will be downloaded; the code wait for the server to start, but it can take a while and you may get false positives saying server was unable to start.
    - All kinds of problems if there are concurrent processes that need to wait for the same server to start
- Use this mostly to prototype using single process. Do not use for concurrent execution.


`call_llm` example:

In [None]:
# Suppose we ran:
# python -m llms.providers.hugging_face.host_model_hf "Qwen/Qwen2.5-VL-3B-Instruct" --host localhost --port 8000

# Then we can send inference to this server by adding these args in `gen_args`:
gen_args = {
    "model": "Qwen/Qwen2.5-VL-3B-Instruct",
    "num_generations": 1,
    "temperature": 0.5,
    "max_tokens": 1000,
    "top_p": 0.95,
    "top_k": 40,
    "repetition_penalty": 1.05,
    "engine": "server",  # <--------- CHANGED `automodel` to `server`
    "endpoint": "localhost:8000"  # <--------- ADDED
}

# No need for any change in the inputs.

response, model_generations = call_llm(gen_args, inputs, conversation_dir=conversation_dir, usage_dir=usage_dir)

#### Hugging Face - VLLM Engine

The `vllm` makes model available to receive requests at `endpoint`, so multiple processes can send inference requests without using multiple GPUs.

NOTES:
- The idea is the same as `server`, but in this case the server is handled by `vllm`
- `vllm` has non-trivial optimization to handle concurrent calls. May be a better option in cases of high demand for the server.
- Issue: `vllm` tends to consume a lot of GPU memory to realize its optimizations. 
    - You may run out of memory even for models that are typically possible to load with vanilla automodel.
    - In these cases, try to increase `--gpu-mem` (between 0 and 1), do not pass `--enforce-eager` (set to false), and reduce `--max-model-len`.


There are two ways to deploy in this mode:



**Option 1: Host the model first, then send inference calls with `call_llm`**

1. Run 

```bash
python -m llms.providers.hugging_face.host_model_vllm <model_id> --host <host> --port <port> --num-gpus <num_gpus> --max-model-len <max_model_len>` 
# (check all params using -h)
```

2. Add `engine:vllm` and `<host>:<port>` in `gen_args`

**NOTE**: If hosting in `machineA` and accessing model via `machineB`: execute step 1 in machineA; to `call_llm` from `machineB`, set `host` to the IP of machineA.

**Option 2: Directly call `call_llm` with `engine:vllm` and `<host>:<port>` in `gen_args`.**
- All the warnings from the `server` case apply here too.

`call_llm` example:

In [None]:
# Suppose we ran:
# python -m llms.providers.hugging_face.host_model_vllm "Qwen/Qwen2.5-VL-3B-Instruct" --host localhost --port 8000

# Then we can send inference to this server by adding these args in `gen_args`:
gen_args = {
    "model": "Qwen/Qwen2.5-VL-3B-Instruct",
    "num_generations": 1,
    "temperature": 0.5,
    "max_tokens": 1000,
    "top_p": 0.95,
    "top_k": 40,
    "repetition_penalty": 1.05,
    "engine": "server",  # <--------- CHANGED `automodel` to `server`
    "endpoint": "localhost:8000"  # <--------- ADDED
}

# No need for any change in the inputs.

response, model_generations = call_llm(gen_args, inputs, conversation_dir=conversation_dir, usage_dir=usage_dir)

## 4) Prompting and get_messages

In [4]:
from llms.prompt_utils import get_messages, get_message

The functions `get_messages` and `get_message` gives more fine-grained control to send the prompts. 
- Obs.: Anything can also be done via the flexible list of inputs as explained in (3).

The function `get_message` creates a single `Message` object given:
- `inputs`: list of raw data in flexible format (same way as given to `call_llm` as explained in (3))
- `role`: of the entity responsible for the message
- `name` of the entity responsible for the message
- `img_detail`: for providers that support, defines how much details to apply to the image

`get_messages` Is the same thing, but gives you a list of Message objects instead. It also allows:
- to give the `sys_prompt` via an argument as well.
- concatenate consecutive texts into one `Message` by setting `concatenate_text=True`

Consider the same `inputs` as before. We can create a list of Message objects from it as below. This is exactly what `call_llm` does behind the scenes.

In [7]:
inputs=[
        {"role": "system", "text": "You are an intelligent and helpful assistant."}, 
        "Describe **all** the below items.",
        {"role": "user", "text": "Item (1):", "image": "llms/examples/cat.png"}, # Another way to send an input
        {"role": "user", "contents":[{"type": "text", "text": "Item (2):"}, {"type": "image", "image": "llms/examples/dog.png"}]}, #OpenAI chat completion format
        "Item (3): Once upon a time, there was a princess who lived in a castle."
        "Provide your response as follows: <Title for Item 1> <Description for Item 1> <Title for Item 2> <Description for Item 2> <Title for Item 3> <Description for Item 3>"
  ]

# Create a list of Message objects from the inputs
messages = get_messages(inputs)

messages

[Message(role='system', contents=[ContentItem(type='text', data='You are an intelligent and helpful assistant.', meta_data={}, id=None)], name='', meta_data={}),
 Message(role='user', contents=[ContentItem(type='text', data='Describe **all** the below items.', meta_data={}, id=None)], name='', meta_data={}),
 Message(role='user', contents=[ContentItem(type='text', data='Item (1):', meta_data={}, id=None), ContentItem(type='image', data=<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=439x582 at 0x7FF2F964FD10>, meta_data={}, id=None)], name='', meta_data={}),
 Message(role='user', contents=[ContentItem(type='text', data='Item (2):', meta_data={}, id=None), ContentItem(type='image', data=<PIL.PngImagePlugin.PngImageFile image mode=RGB size=234x148 at 0x7FF2F964FFD0>, meta_data={}, id=None)], name='', meta_data={}),
 Message(role='user', contents=[ContentItem(type='text', data='Item (3): Once upon a time, there was a princess who lived in a castle.Provide your response as follows: <

Examples:

In [12]:
# Create a message object with higher image detail; note we can also give a `name` to the user (some providers support it)
msg_ex_user = get_message(["Item (1):", "llms/examples/cat.png"], role="user", name="example_user", img_detail="high")

# Create an ASSISTANT message; note we can also give a `name` to the assistant (some providers support it)
msg_ex_assistant = get_message(["This is a cat"], role="assistant", name="example_assistant")

# Create a SYSTEM message
msg_system = get_message("You are an intelligent and helpful assistant.", role="system")


# get a full prompt to send to the model
get_messages(
    [
        msg_system,
        msg_ex_user,
        msg_ex_assistant,
        {"role": "user", "contents":[{"type": "text", "text": "Item (2):"}, {"type": "image", "image": "llms/examples/dog.png"}]}, #OpenAI chat completion format
        "Item (3): Once upon a time, there was a princess who lived in a castle."
        "Please describe the new items in the conversation."
    ],
    concatenate_text=True, # Concatenate consecutive texts into one `Message`. Note the last two are all in the same message.
    role="user", # This role is applied to all messages without a role. (e.g.: last two)
    name="user", # This name is applied to all messages without a name. (e.g.: last two)
)


[Message(role='system', contents=[ContentItem(type='text', data='You are an intelligent and helpful assistant.', meta_data={}, id=None)], name='', meta_data={}),
 Message(role='user', contents=[ContentItem(type='text', data='Item (1):', meta_data={}, id=None), ContentItem(type='image', data=<PIL.PngImagePlugin.PngImageFile image mode=RGBA size=439x582 at 0x7FF2E9E7C810>, meta_data={'img_detail': 'high'}, id=None)], name='example_user', meta_data={}),
 Message(role='assistant', contents=[ContentItem(type='text', data='This is a cat', meta_data={}, id=None)], name='example_assistant', meta_data={}),
 Message(role='user', contents=[ContentItem(type='text', data='Item (2):', meta_data={}, id=None), ContentItem(type='image', data=<PIL.PngImagePlugin.PngImageFile image mode=RGB size=234x148 at 0x7FF2E9E7D890>, meta_data={}, id=None)], name='', meta_data={}),
 Message(role='user', contents=[ContentItem(type='text', data='Item (3): Once upon a time, there was a princess who lived in a castle.P