# 🤖 Sakila Data Agent Guide 📊

In this notebook we will develop the neccessary functions to build a Sakila Data AI Agent based on the OpenAI API.

- We will cover the following topics:

    1. ChatGPT API call
    2. System Prompts
    3. Tool Calls
    4. Tool Call Parsing
    5. SQL Data Querying Tool
    6. Create the Sakila Data Agent

- To run the OpenAI API you will need to add an API key to a `.env` file. Also, provide the database credentials. use the `.env.sample` file as a template, remember to rename it to `.env` after cloning the repository.

- Relevant Docs: [OpenAI API Documentation](https://platform.openai.com/docs/overview)

### 0. Install the OpenAI library
- Let's start by installing the OpenAI library

In [1]:
!pip install openai



## 1. Import Libraries
- Now let's import the necessary libraries for the notebook.

In [2]:
import json # for parsing the JSON responses from the OpenAI API, we'll see more about this later
from openai import OpenAI

In [3]:
# Libraries & functions to pretty print text and JSON responses, just needed for the notebook
import textwrap # Text wrapping
from pprint import pprint # Pretty printing json

def pretty_print(response):
    if isinstance(response, (str, dict)):
        print(textwrap.fill(response, width=80))
    else:
        response_dict = response.model_dump()
        pprint(response_dict, width=80)


## 1. ChatGPT API call

+ [OpenAI Quickstart](https://platform.openai.com/docs/quickstart)

Find below the basic ChatGPT Chat Completion API call. It has 3 parameters:
1. `model`: the model to use, in this case `'gpt-4o-mini'`
2. `messages`: the list of messages to send to the model, here is where the conversation history is passed. The first message is the system prompt, and the rest is the alternating conversation between the user and the assistant.   
    - The system prompt: the initial and instructions for the model
    - The user prompt: the question or task for the model
3. `tools`: the available tools to use

In [4]:
# Initialize the OpenAI client, the API key is automatically loaded from the .env file
client = OpenAI()

# Make a basic ChatGPT API call
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    tools= None,
    messages=[
        {   "role": "system", 
            "content": "You are a helpful assistant."
        }, # The system prompt is always the first message
        {
            "role": "user",
            "content": "Which country has the most bordering countries?" # User prompt
        }
    ]
)

# Access the response from the LLM
response = completion.choices[0].message # The response from the LLM

In [5]:
# Print the chat completion response
pretty_print(response)

{'annotations': [],
 'audio': None,
 'content': 'The country with the most bordering countries is China, which '
            'shares its borders with 14 different countries. These countries '
            'are India, Russia, Mongolia, Pakistan, Nepal, Bhutan, '
            'Afghanistan, Tajikistan, Kyrgyzstan, and Kazakhstan, as well as '
            'the Special Administrative Regions of Hong Kong and Macau, which '
            'have borders with neighboring regions of countries. \n'
            '\n'
            'In addition, Brazil also shares borders with 10 countries and is '
            'notable for having the most neighboring countries in South '
            'America. However, the record for the most total bordering '
            'countries is held by China.',
 'function_call': None,
 'refusal': None,
 'role': 'assistant',
 'tool_calls': None}


The chat completion `response` JSON object above has 2 keys we will be working with:
1. `content`: the text response from the LLM
2. `tool_calls`: the tool calls made by the LLM

# 2. System Prompt

[OpenAI Prompt Engineering Guide](https://platform.openai.com/docs/guides/prompt-engineering)

+ System prompts are a way to <u>guide the AI's behavior</u>. They are a part of the prompt that is sent to the AI.
+ The system prompt is sent to the AI at the beginning of the conversation.

In [12]:
SYSTEM_PROMPT = """
You are a helpful assistant and an expert data analyst that can answer questions about the sakila database. Respond in a very formal and concise way.
"""

In [13]:
client = OpenAI()

# Make a basic ChatGPT API call
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {   "role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": "Who are you?"
        }
    ]
)

response = completion.choices[0].message

In [14]:
pretty_print(response)

{'annotations': [],
 'audio': None,
 'content': 'I am an expert data analyst with specialized knowledge of the '
            'Sakila database. I am here to assist you by providing insights '
            'and answering questions related to this database. How may I '
            'assist you today?',
 'function_call': None,
 'refusal': None,
 'role': 'assistant',
 'tool_calls': None}


# 3. Tool Calling

[OpenAI Function Calling Docs](https://platform.openai.com/docs/guides/function-calling)

### What are tool calls?
+ Tool calling is a way to <u>extend the capabilities of the AI</u>.
+ It allows the AI to execute functions and return the results.
+ It is a way to <u>integrate external tools</u> into the AI's behavior.

To provide a tool to the LLM, we need to define two things: 
1. Tool function schema definition
2. A python function to be called when the LLM invokes the tool

### 3.1 Tool Function Definition

The tool function definition is a dictionary that describes the tool to the LLM.
It has the following keys:
1. **type**: the type of the function, it must be `"function"`
2. **function**: a dictionary with the following keys:
    - **name**: the name of the function
    - **description**: the description of the function
    - **parameters**: the parameters of the function
    

In [16]:
# Create an array of tools following the OpenAI's function definition schema
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the weather in a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"}
                },
                "required": ["location"]
            },
        },
    },
]

### 3.2 Python Function to be Called when the LLM invokes the tool

This is simply a python function that will be called when the LLM invokes the tool.

In [17]:
# Create a python function to be called when the LLM invokes the tool
def get_weather(location):
    # Here we simulate the weather function with hardcoded values, just for testing
    return f"The weather in {location} is sunny and 22°C"

# 4. Tool Call Parsing

How do tools work?
+ When the user asks something that should be answered using one or multiple of the predefined tools, the LLM `response` object will contain a `tool_calls` key with a list of tool calls. The `content` key will be empty.
+ Each tool call is a dictionary with the following keys:
    - `name`: the name of the tool
    - `arguments`: the arguments of the tool
+ Then we **parse out the tool call arguments** and execute the python function with the arguments.

Now we are ready to test the tool calling and parse the response to get the tool call arguments.

What does parsing mean?
- Parsing is the process of extracting desired information from a string or a data structure. In this case we want to get the JSON object from the `arguments` key and convert it to a python dictionary. 
+ JSON Object -> Python Dictionary

In [18]:
client = OpenAI()

# Make a ChatGPT API call with tool calling
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    tools=tools, # here we pass the tools to the LLM
    messages=[
        {   "role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "What is the current weather in Barcelona?"
        }
    ]
)

response = completion.choices[0].message

In [19]:
pretty_print(response)

{'annotations': [],
 'audio': None,
 'content': None,
 'function_call': None,
 'refusal': None,
 'role': 'assistant',
 'tool_calls': [{'function': {'arguments': '{"location":"Barcelona"}',
                              'name': 'get_weather'},
                 'id': 'call_pZDSv3QizYltiBSy1MtN72Eg',
                 'type': 'function'}]}


If we examine the response `'tool_calls'`key we can see that the LLM has called the tool `get_weather` with the argument `{"location": "Barcelona"}`.

We can now parse the response to get the tool call arguments and execute the python function to get the actual weather in Barcelona.

In [20]:
# Check if the response contains a tool call
if response.tool_calls:
    # Process each tool call
    for tool_call in response.tool_calls:
        # Get the tool call arguments
        tool_call_arguments = json.loads(tool_call.function.arguments) # Parse the JSON object to a python dictionary
        if tool_call.function.name == "get_weather":
            # Here we call the predefined get_weather function and pass the tool call arguments the LLM provided
            print(f"Tool arguments: {tool_call_arguments}")
            print(get_weather(tool_call_arguments["location"]))
else:
    pass # To show the assistant response when no tool is called, return the response content here
            

Tool arguments: {'location': 'Barcelona'}
The weather in Barcelona is sunny and 22°C


# 5. SQL Data Querying Tool

Now let's create a new tool to get data from the database. It will work like this:
1. When the LLM needs to call the tool to get data from the database, it will call the tool and pass the SQL query to be executed.
2. The tool will execute the SQL query and return the results.

### 5.1 Connection to the database using sqlalchemy

In [21]:
!pip install pymysql sqlalchemy



In [22]:
# Import the necessary libraries
import os
import pymysql
from sqlalchemy import create_engine, text
import pandas as pd
from dotenv import load_dotenv

# Load the environment variables
load_dotenv()

True

In [23]:
# Create the connection string
password = os.getenv("DB_PASSWORD")
connection_string = 'mysql+pymysql://root:' + password + '@localhost/sakila'
# Create the engine
engine = create_engine(connection_string)

In [24]:
# Testing the connection
query = "SELECT * FROM actor"

with engine.connect() as connection:    
    query = text(query)
    result = connection.execute(query)
    df = pd.DataFrame(result.all())

df.head()

Unnamed: 0,actor_id,first_name,last_name,last_update
0,1,PENELOPE,GUINESS,2006-02-15 04:34:33
1,2,NICK,WAHLBERG,2006-02-15 04:34:33
2,3,ED,CHASE,2006-02-15 04:34:33
3,4,JENNIFER,DAVIS,2006-02-15 04:34:33
4,5,JOHNNY,LOLLOBRIGIDA,2006-02-15 04:34:33


### 5.2 Create a function to get data from the database

In [25]:
# Create a function to get data from the database
def get_data_df(query):
    # Create the connection string
    password = os.getenv("DB_PASSWORD")
    connection_string = 'mysql+pymysql://root:' + password + '@localhost/sakila'
    # Create the engine
    engine = create_engine(connection_string)
    with engine.connect() as connection:    
        query = text(query)
        result = connection.execute(query)
        df = pd.DataFrame(result.all())
    return df

In [27]:
# Test the function
query = "SELECT * FROM city"
df = get_data_df(query)
df.head()

Unnamed: 0,city_id,city,country_id,last_update
0,1,A Coruña (La Coruña),87,2006-02-15 04:45:25
1,2,Abha,82,2006-02-15 04:45:25
2,3,Abu Dhabi,101,2006-02-15 04:45:25
3,4,Acuña,60,2006-02-15 04:45:25
4,5,Adana,97,2006-02-15 04:45:25


# 6. Create the Sakila Data Agent

Now we can put all the pieces together and create the agent function that will be used in the Streamlit app. Our simple AI agent is composed of 3 elements
1. System prompt
2. Tools
3. Agent

### 6.1 System Prompt

In [28]:
from agent.sakila_schema import SAKILA_SCHEMA

SYSTEM_PROMPT = f"""
- You are a helpful assistant and an expert data analyst that can answer questions about the sakila database. 
- Use the get_data_df tool to get the data from the database. 
- Generate SQL queries following the schema of the database. 

DATABASE SCHEMA:
{SAKILA_SCHEMA}
"""

In [29]:
pretty_print(SYSTEM_PROMPT)

 - You are a helpful assistant and an expert data analyst that can answer
questions about the sakila database.  - Use the get_data_df tool to get the data
from the database.  - Generate SQL queries following the schema of the database.
actor_info - address - category - city - country - customer - customer_list -
customer_rental_info - customer_rental_payment1 - events - film - film_actor -
film_category - film_list - film_text - inventory - language -
nicer_but_slower_film_list - payment - rental - sales_by_film_category -
sales_by_store - staff - staff_list - store
------------   actor_id:     Type: smallint unsigned     Nullable: NO     Key:
PRI   first_name:     Type: varchar(45)     Nullable: NO   last_name:     Type:
varchar(45)     Nullable: NO     Key: MUL   last_update:     Type: timestamp
Nullable: NO  Table: actor_info -----------------   actor_id:     Type: smallint
unsigned     Nullable: NO   first_name:     Type: varchar(45)     Nullable: NO
last_name:     Type: varchar(45

### 6.2 Tools

In [30]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_data_df",
            "description": "Get the data from the database",
            "parameters": {
                "type": "object",
                "properties": {
                    "sql_query": {"type": "string"}
                },
                "required": ["sql_query"]
            },
        },
    },
]

### 6.3 Agent

In [31]:
# Let's create a function to use the OpenAI API with tool calling, and let's include the get_data_df tool

def agent(messages):
    
    # Initialize the OpenAI client
    client = OpenAI()

    # Make a ChatGPT API call with tool calling
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        tools=tools, # here we pass the tools to the LLM
        messages=messages
    )

    response = completion.choices[0].message

    # Check if the response contains a tool call
    if response.tool_calls:
        # Process each tool call
        for tool_call in response.tool_calls:
            # Get the tool call arguments
            tool_call_arguments = json.loads(tool_call.function.arguments) # Parse the JSON object to a python dictionary
            if tool_call.function.name == "get_data_df":
                # Here we call the predefined get_weather function and pass the tool call arguments the LLM provided
                print(f"Tool arguments: {tool_call_arguments}")
                return get_data_df(tool_call_arguments["sql_query"])
    else:
        return response.content

In [33]:
# Test the function
user_prompt = "What is your name?" # "What is the current weather in Barcelona?"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": user_prompt}
]

response = agent(messages)

response

'I am an AI assistant designed to help you with data analysis and queries related to the Sakila database. You can call me Assistant! How can I assist you today?'

# 7. Streamlit App

Okay, now we have developed all the agent function to be able to create the Sakila Data Agent. Next we have to use this functions in a streamlit app to build the UI to the agent.

Use the empty `app.py` file to build the UI to the agent. Copy the `chatbot.py` file from [this repository](https://github.com/streamlit/llm-examples/blob/main/Chatbot.py) and paste it on the `app.py` file and work from there.

🍀 **GOOD LUCK!**