# Integrating Refuel Autolabel into EvaDB

## 1. Setup

In [1]:
!pip install 'refuel-autolabel'
!pip install evadb

Collecting refuel-autolabel
  Obtaining dependency information for refuel-autolabel from https://files.pythonhosted.org/packages/5c/6c/08eb43748e8b9a4af05dd10346c852f3cfb8feeb1dc66407794d2350cbf9/refuel_autolabel-0.0.16-py3-none-any.whl.metadata
  Using cached refuel_autolabel-0.0.16-py3-none-any.whl.metadata (13 kB)
Collecting loguru>=0.5.0 (from refuel-autolabel)
  Obtaining dependency information for loguru>=0.5.0 from https://files.pythonhosted.org/packages/03/0a/4f6fed21aa246c6b49b561ca55facacc2a44b87d65b8b92362a8e99ba202/loguru-0.7.2-py3-none-any.whl.metadata
  Using cached loguru-0.7.2-py3-none-any.whl.metadata (23 kB)
Collecting numpy>=1.23.0 (from refuel-autolabel)
  Obtaining dependency information for numpy>=1.23.0 from https://files.pythonhosted.org/packages/2e/54/218ce51bb571a70975f223671b2a86aa951e83abfd2a416a3d540f35115c/numpy-1.26.2-cp311-cp311-macosx_11_0_arm64.whl.metadata
  Downloading numpy-1.26.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (115 kB)
[2K     [90m━━━

In [2]:
import os
import evadb
from my_secrets import OPENAI_KEY
# provide your own OpenAI API key here
os.environ['OPENAI_API_KEY'] = OPENAI_KEY

In [3]:
cursor = evadb.connect().cursor()
print("Connected to EvaDB")

Connected to EvaDB


In [4]:
from autolabel import get_data
get_data('walmart_amazon')

## 2. RefuelAutolabel Function

### 2.1 Create Function in EvaDB

I implemented the function which supports autolabeling via Refuel in functions/refuel_autolabel.py. The cell below simply creates the function in EvaDB using my implementation

I'm supporting the following actions:

- plan - plan the autolabeling task by giving a cost estimate and providing an example prompt which will be used to label an example
- run - run the autolabeling task and insert the generated labels into the dataset
- explain - explain the label generated for each example in the dataset

In Refuel, these actions are all implemented in the LabelingAgent class. This class is the one we use to label datasets if we use Refuel's Python API

In [7]:
create_function_query = f"""CREATE FUNCTION IF NOT EXISTS RefuelAutolabel
            IMPL  '../functions/refuel_autolabel.py';
            """
cursor.query("DROP FUNCTION IF EXISTS RefuelAutolabel;").execute()
cursor.query(create_function_query).execute()
print("Created Function")



Created Function


### 2.2 Plan Autolabeling

The following cell uses the autolabeling configuration specified by config_banking.json and the size of seed.csv to estimate the cost of labeling seed.csv. Furthermore, it gives an example of a prompt that would be generated and given to ChatGPT to perform the given autolabeling task

The prompt asks the LLM to classify a complaint into one of several predetermined categories and gives the LLM several examples with which to learn from

In [8]:
query= f""" SELECT RefuelAutolabel("plan", '../configs/config_banking.json', 'seed.csv');"""
result = cursor.query(query).execute()

Output()

### 2.3 Run Autolabeling

Simple example with no optional arguments passed

In [9]:
"""
use the function arguments to specify
    1. the mode to execute in (plan, run, explain)
    2. config file
    3. dataset file
    4. (optional) output file
    5. (optional) max items
    6. (optional) start index
    7. (optional) skip eval
"""

query_with_output = f""" SELECT RefuelAutolabel("run", '../configs/config_banking.json', 'seed.csv', 'output.csv');"""

query= f""" SELECT RefuelAutolabel("run", '../configs/config_banking.json', 'seed.csv');"""
result = cursor.query(query).execute()

Output()

Using all of the optional arguments to label 10 examples starting at the 1st example

In [10]:
query= f""" SELECT RefuelAutolabel("run", '../configs/config_walmart.json', 'seed.csv', 'output.csv', '10', '1', 'false');"""
result = cursor.query(query).execute()

Output()

### 2.4 Explain Autolabeling

Although the following cell threw an error, it did generate an explanation for all the examples and labels in seed.csv

In [13]:
query= f"""SELECT RefuelAutolabel("explain", '../configs/config_banking.json', 'seed.csv');"""
result = cursor.query(query).execute()

Output()



11-25-2023 21:32:38 ERROR [plan_executor:plan_executor.py:execute_plan:0182] Batch constructor not properly called.
Expected pandas.DataFrame, got <class 'list'>
Traceback (most recent call last):
  File "/Users/krishnathan/Dropbox (GaTech)/Documents/College/Fall 23/CS 6422/Project/evadb-project-2/evadb_env/lib/python3.11/site-packages/evadb/executor/plan_executor.py", line 178, in execute_plan
    yield from output
  File "/Users/krishnathan/Dropbox (GaTech)/Documents/College/Fall 23/CS 6422/Project/evadb-project-2/evadb_env/lib/python3.11/site-packages/evadb/executor/project_executor.py", line 42, in exec
    batch = apply_project(dummy_batch, self.target_list)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/krishnathan/Dropbox (GaTech)/Documents/College/Fall 23/CS 6422/Project/evadb-project-2/evadb_env/lib/python3.11/site-packages/evadb/executor/executor_utils.py", line 69, in apply_project
    batches = [expr.evaluate(batch) for expr in project_list]
       

ExecutorError: Batch constructor not properly called.
Expected pandas.DataFrame, got <class 'list'>

### 2.5 Bringing the labeled dataset into EvaDB

Here, I'm pulling the labeled banking dataset into EvaDB. I had to rename it to seed_banking.csv and test_banking.csv so that I could run the Walmart example with only labelling 10 examples. I'm not sure why the dataset currently being labeled has to be named seed.csv and test.csv. Unfortunately, Refuel's API does not allow me to specify the dataset names in a more flexible way. This could be something to investigate further if we wish to work with several datasets

In [75]:
drop_query = "DROP TABLE IF EXISTS MyCSV"
cursor.query(drop_query).execute()

query1 = """CREATE TABLE IF NOT EXISTS MyCSV (
                example TEXT(1000),
                label TEXT(100),
                explanation TEXT(1000)
            );"""

query2 = "LOAD CSV 'test_banking.csv' INTO MyCSV;"
query3 = "SELECT * FROM MyCSV;"

result = cursor.query(query1).execute()
result = cursor.query(query2).execute()
result = cursor.query(query3).execute()

print(result)

      _row_id  \
0           1   
1           2   
2           3   
3           4   
4           5   
...       ...   
1993     1994   
1994     1995   
1995     1996   
1996     1997   
1997     1998   

                                                                                                  example  \
0                                                                              I want to close my account   
1     It seems I was overcharged when I used an ATM while on vacation. If I knew about your fees in ad...   
2                                I have a direct debit transaction I have not set up, but would like to .   
3                                                         How much does it cost in fees to use your card?   
4                                                                             There is an extra $1 charge   
...                                                                                                   ...   
1993                            