## Agent Framework: Multimodal AI Assistants

*[Coding along with the Udemy online course [LLM Engineering: Master AI & Large Language Models](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/) by Ed Donner; GitHub repo can be found at [github.com/ed-donner/llm_engineering](https://github.com/ed-donner/llm_engineering)]*

### The Agent Framework

The term 'Agentic AI' and Agentization is an umbrella term that refers to a number of techniques, such as:

1. Breaking a complex problem into smaller steps, with multiple LLMs carrying out specialized tasks
2. The ability for LLMs to use Tools to give them additional capabilities
3. The 'Agent Environment' which allows Agents to collaborate
4. An LLM can act as the Planner, dividing bigger tasks into smaller ones for the specialists
5. The concept of an Agent having autonomy / agency, beyond just responding to a prompt - such as Memory

(Source: [Build a Multimodal AI Agent](https://www.udemy.com/course/llm-engineering-master-ai-and-large-language-models/learn/lecture/45775535))

## Project: Creating a Multimodal AI Assistant Using Agents and Tools

In [1]:
from openai import OpenAI
import pandas as pd
# some imports for handling images
import base64
from io import BytesIO
from PIL import Image
import datetime
import gradio as gr
import json

In [2]:
openai_api_key = pd.read_csv("~/tmp/chat_gpt/agentic-design-1.txt", sep=" ", header=None)[0][0]

# connect to openai
openai = OpenAI(api_key=openai_api_key)
MODEL = "gpt-4o-mini"
print("Don't be a fool and sent your api key to github")

Don't be a fool and sent your api key to github


In [3]:
system_message = "You are a helpful assistant for an Airline called FlightAI. "
system_message += "Give short, courteous answers, no more than 1 sentence. "
system_message += "Always be accurate. If you don't know the answer, say so."

#### __Image Generation with DALL-E-3__

In [4]:
def artist(city):
    # call to images.generate() to generate images
    image_response = openai.images.generate(
            model="dall-e-3",
            prompt=f"An image representing a vacation in {city}, showing tourist spots and everything unique about {city}, in a vibrant pop-art style",
            size="1024x1024", # smallest size in dall-e-3
            n=1, # we want one image back
            response_format="b64_json",
        )
    image_base64 = image_response.data[0].b64_json # base64 encoded image
    image_data = base64.b64decode(image_base64) # decoding image data

    # saving image to disk
    img = Image.open(BytesIO(image_data))
    image_name = "../../assets/dall-e-images/" + city + "_" + datetime.datetime.now().strftime("%Y%m%d%H%M%S") + ".jpg"
    img.save(image_name, "JPEG")
    
    return Image.open(BytesIO(image_data)) # return image with Image.open function

#### __OpenAI Audio Generation__

In [5]:
from pydub import AudioSegment
from pydub.playback import play

def talker(message):
    response = openai.audio.speech.create(
      model="tts-1", # text to speach model
      voice="onyx", # providing onyx as a voice; alternatively alloy
      input=message
    )
    
    audio_stream = BytesIO(response.content) # create bytes object
    audio = AudioSegment.from_file(audio_stream, format="mp3")
    play(audio)

#### __Handling the Tool Call__

In [6]:
ticket_prices = {"london": "$799", "paris": "$899", "tokyo": "$1400", "berlin": "$499"}

def get_ticket_price(destination_city):
    print(f"Tool get_ticket_price called for {destination_city}")
    city = destination_city.lower()
    return ticket_prices.get(city, "Unknown")

In [7]:
price_function = {
    "name": "get_ticket_price",
    "description": "Get the price of a return ticket to the destination city. Call this whenever you need to know the ticket price, for example when a customer asks 'How much is a ticket to this city'",
    "parameters": {
        "type": "object",
        "properties": {
            "destination_city": {
                "type": "string",
                "description": "The city that the customer wants to travel to",
            },
        },
        "required": ["destination_city"],
        "additionalProperties": False
    }
}

In [8]:
tools = [{"type": "function", "function": price_function}]

In [9]:
def handle_tool_call(message):
    tool_call = message.tool_calls[0]
    arguments = json.loads(tool_call.function.arguments)
    city = arguments.get('destination_city')
    price = get_ticket_price(city)
    response = {
        "role": "tool",
        "content": json.dumps({"destination_city": city,"price": price}),
        "tool_call_id": message.tool_calls[0].id
    }
    return response, city

#### __The Chat Method__

In [10]:
# defining the chat method that gradio will need
def chat(message, history):
    image = None
    conversation = [{"role": "system", "content": system_message}]
    for human, assistant in history:
        conversation.append({"role": "user", "content": human})
        conversation.append({"role": "assistant", "content": assistant})
    conversation.append({"role": "user", "content": message})
    response = openai.chat.completions.create(model=MODEL, messages=conversation, tools=tools)

    # finding out if the model wants to call a tool
    if response.choices[0].finish_reason=="tool_calls":
        message = tool_call = response.choices[0].message
        response, city = handle_tool_call(message)
        conversation.append(message)
        conversation.append(response)
        # if the model needs to run the tools to get a price we've also the artist to generate an image
        image = artist(city)
        response = openai.chat.completions.create(model=MODEL, messages=conversation)

    reply = response.choices[0].message.content
    # once the response is collected we call talker to speak the response
    talker(reply)
    return reply, image

#### __The Gradio Interface__

In [11]:
# More involved Gradio code as we're not using the preset Chat interface

with gr.Blocks() as ui:
    with gr.Row():
        chatbot = gr.Chatbot(height=500)
        image_output = gr.Image(height=500)
    with gr.Row():
        msg = gr.Textbox(label="Chat with our AI Assistant:")
    with gr.Row():
        clear = gr.Button("Clear")

    def user(user_message, history):
        return "", history + [[user_message, None]]

    def bot(history):
        user_message = history[-1][0]
        bot_message, image = chat(user_message, history[:-1])
        history[-1][1] = bot_message
        return history, image

    msg.submit(user, [msg, chatbot], [msg, chatbot], queue=False).then(
        bot, chatbot, [chatbot, image_output]
    )
    clear.click(lambda: None, None, chatbot, queue=False)

ui.launch()



* Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.




Input #0, wav, from '/var/folders/7c/6tn50bjd30l3zb0p8_7mr94m0000gn/T/tmpqx7qevw7.wav':
  Duration: 00:00:01.97, bitrate: 384 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 24000 Hz, 1 channels, s16, 384 kb/s
   1.93 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/7c/6tn50bjd30l3zb0p8_7mr94m0000gn/T/tmpidfzqbrx.wav':
  Duration: 00:00:02.83, bitrate: 384 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 24000 Hz, 1 channels, s16, 384 kb/s
   2.78 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 


Tool get_ticket_price called for London


Input #0, wav, from '/var/folders/7c/6tn50bjd30l3zb0p8_7mr94m0000gn/T/tmp5vsmdtlo.wav':
  Duration: 00:00:06.12, bitrate: 384 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 24000 Hz, 1 channels, s16, 384 kb/s
   6.02 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 


Tool get_ticket_price called for Tokyo


Input #0, wav, from '/var/folders/7c/6tn50bjd30l3zb0p8_7mr94m0000gn/T/tmpqdw4w9rp.wav':
  Duration: 00:00:07.01, bitrate: 384 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 24000 Hz, 1 channels, s16, 384 kb/s
   6.96 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/7c/6tn50bjd30l3zb0p8_7mr94m0000gn/T/tmpozsnoqhd.wav':
  Duration: 00:00:04.30, bitrate: 384 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 24000 Hz, 1 channels, s16, 384 kb/s
   4.22 M-A: -0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 




Input #0, wav, from '/var/folders/7c/6tn50bjd30l3zb0p8_7mr94m0000gn/T/tmpqli535v4.wav':
  Duration: 00:00:04.03, bitrate: 384 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 24000 Hz, 1 channels, s16, 384 kb/s
   3.97 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 


Tool get_ticket_price called for Paris


Input #0, wav, from '/var/folders/7c/6tn50bjd30l3zb0p8_7mr94m0000gn/T/tmp82_m6e_a.wav':
  Duration: 00:00:04.63, bitrate: 384 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 24000 Hz, 1 channels, s16, 384 kb/s
   4.57 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 


Tool get_ticket_price called for Berlin


Input #0, wav, from '/var/folders/7c/6tn50bjd30l3zb0p8_7mr94m0000gn/T/tmp2y08l724.wav':
  Duration: 00:00:06.14, bitrate: 384 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 24000 Hz, 1 channels, s16, 384 kb/s
   6.04 M-A:  0.000 fd=   0 aq=    0KB vq=    0KB sq=    0B 


