# Setup

The llama setup was moved to __experiment/setupExperiment.py__ inside the __init__() function.

## Load Data

The loading and handling of the data sets is handled by the external DataSet class. This class provides access to all the datasets as datafames via their Subclasses, such as the Flights class. Each Subclass provides access to the data as DataFrames via the .get(Boolean) function. The Boolean controls whether the dirty or clean set should be returned.

Each DataSet Subclass also provides the following functionalities:
* random_sample(amount) generates two DataFrames, a dirty set and a clean set, of the given size with random (non-duplicate) sample rows.
* generate_examples(column_id, amount) generates an example string of 'amount' random sample rows. Whether or not there is an error in the column of the given id is appended as well. 


In [8]:
from experiment.errorDetection import Flights, Food, Hospital
data = Flights().get(True)
print(data[:10])

from_string grammar:



root ::= root_1 
root_1 ::= [Y] [e] [s] | [N] [o] 
  src           flight sched_dep_time act_dep_time        sched_arr_time  \
0  aa  AA-3859-IAH-ORD      7:10 a.m.    7:16 a.m.             9:40 a.m.   
1  aa  AA-1733-ORD-PHX      7:45 p.m.    7:58 p.m.            10:30 p.m.   
2  aa  AA-1640-MIA-MCO      6:30 p.m.          NaN             7:25 p.m.   
3  aa   AA-518-MIA-JFK      6:40 a.m.    6:54 a.m.             9:25 a.m.   
4  aa  AA-3756-ORD-SLC     12:15 p.m.   12:41 p.m.             2:45 p.m.   
5  aa   AA-204-LAX-MCO     11:25 p.m.          NaN  12/02/2011 6:55 a.m.   
6  aa  AA-3468-CVG-MIA      7:00 a.m.    7:25 a.m.             9:55 a.m.   
7  aa   AA-484-DFW-MIA      4:15 p.m.    4:29 p.m.             7:55 p.m.   
8  aa   AA-446-DFW-PHL     11:50 a.m.   12:12 p.m.             3:50 p.m.   
9  aa   AA-466-IAH-MIA      6:00 a.m.    6:08 a.m.             9:20 a.m.   

  act_arr_time  
0    9:32 a.m.  
1          NaN  
2          NaN  
3    9:28 a.m.  
4    2:50 p.m.  
5         

In [9]:
dirty, clean = Flights().random_sample_with_quota(10, dirty_amount=2)
print(dirty)
print("----")
print(clean)

                src           flight sched_dep_time     act_dep_time  \
584     flightstats  CO-1193-EWR-MCO      9:15 a.m.        9:14 a.m.   
891         flights  AA-2268-PHX-ORD      7:15 a.m.        7:22 a.m.   
1788  flylouisville  AA-1664-MIA-ATL     10:15 a.m.       10:18 a.m.   
1456           ifly  AA-3756-ORD-SLC     12:15 p.m.       12:41 p.m.   
28               aa  AA-1221-MCO-ORD      8:00 p.m.        8:23 p.m.   
1753             ua  UA-2314-ATL-PHL      2:55 p.m.        2:55 p.m.   
457      flightview  AA-1886-BOS-MIA            NaN       10:55 a.m.   
616     flightstats  AA-4277-CVG-JFK     12:10 p.m.       12:10 p.m.   
1435             CO    CO-62-IAH-EWR      2:30 p.m.        2:48 p.m.   
1063    travelocity  UA-3099-PHX-PHL     11:55 a.m.  Contact Airline   

     sched_arr_time     act_arr_time  
584      12:18 p.m.       12:09 p.m.  
891      11:35 a.m.       11:06 a.m.  
1788     12:10 p.m.       11:56 a.m.  
1456      2:45 p.m.        2:50 p.m.  
28        9:

In [10]:
from experiment.duplicateDetection import DuplicateDetection, Affiliation
import os 

def test_duplicateDetection(logging_path: str, n_iterations=3):
    N_ROWS = 5
    N_DUPLICATES = 2
    CHANCE_MULTIPLE_DUPLICATES = 0.3
    if os.path.exists(logging_path):
        raise FileExistsError(f"The directory '{logging_path}' already exists.")
    dd = DuplicateDetection(Affiliation(), logging_path=logging_path)
    for iteration in range(n_iterations):
        dd.few_shot(
            n_samples=N_ROWS,
            rows_with_duplicates=N_DUPLICATES,
            multiple_duplicate_chance=CHANCE_MULTIPLE_DUPLICATES,
            experiment_name=f"NoGrammar-{iteration}",
        )
test_duplicateDetection("keep_logs/duplicateDetection_fewshot")

FileExistsError: The directory 'keep_logs/duplicateDetection_fewshot' already exists.

In [None]:
print(Food().generate_examples(0, 3))

## Prompt Table Zero Shot
The following examples show how to first initialise the experiment and then how to prompt llama for error detection using our external, dedicated python ErrorDetection class. The whole creation of the llama model is inside the construction of the ErrorDetectionModel.

In [None]:
from experiment.errorDetection import ErrorDetection
ed = ErrorDetection(dataset=Flights())

In [None]:
%%capture --no-stdout --no-display

ed.zero_shot(n_samples=2, grammar=ErrorDetection.GRAMMAR_YES_OR_NO)

In [None]:
%%capture --no-stdout --no-display

ed.few_shot(n_samples=2, grammar=ErrorDetection.GRAMMAR_YES_OR_NO)

# Experiments (Setup)

This is how we conduct the time and score peformance experiments for error detection on the three main datasets: Flights, Food and Hospital.

In [None]:
from experiment.errorDetection import ErrorDetection

# maximum number of rows that will be evaluated
MAXIMUM_ROW_COUNT = 10 
# print debug messages such as the prompts and responses
DEBUG_MESSAGES = True 

error_detection_flights = ErrorDetection(dataset=Flights())
error_detection_food = ErrorDetection(dataset=Food())
error_detection_hospital = ErrorDetection(dataset=Hospital())

## Flight Test
Let's first compute the F1 score of Llama resposes by zero-shotting a sample of `MAXIMUM_ROW_COUNT` amount of random rows of the flight table.

For perspective: Prompting a single row (6 attributes in this case) took around 32.192 seconds in one case and around 5.017 in another. It varies quite a bit.

In [None]:
flights_zero_shot_time, flights_zero_shot_score = error_detection_flights.zero_shot(n_samples=MAXIMUM_ROW_COUNT, grammar=ErrorDetection.GRAMMAR_YES_OR_NO)
print(f'Time: %s, f1-score: %s'%(flights_zero_shot_time, flights_zero_shot_score))

Now the same experiment with few-shotting.

In [None]:
flights_few_shot_time, flights_few_shot_score = error_detection_flights.few_shot(n_samples=MAXIMUM_ROW_COUNT, example_count=2, grammar=ErrorDetection.GRAMMAR_YES_OR_NO)
print(f'Time: %s, f1-score: %s'%(flights_few_shot_time, flights_few_shot_score))

# Food Test

Next we conduct the same experiment on the Food dataset and we again start with zero-shot. 

For perspective: A single row (16 attributes) took around 137.798 seconds.

In [None]:
food_zero_shot_time, food_zero_shot_score = error_detection_food.zero_shot(n_samples=MAXIMUM_ROW_COUNT, grammar=ErrorDetection.GRAMMAR_YES_OR_NO)
print(f'Time: %s, f1-score: %s'%(food_zero_shot_time, food_zero_shot_score))

Next up is few shot on this dataset.

In [None]:
food_few_shot_time, food_few_shot_score = error_detection_food.few_shot(n_samples=MAXIMUM_ROW_COUNT, example_count=2, grammar=ErrorDetection.GRAMMAR_YES_OR_NO)
print(f'Time: %s, f1-score: %s'%(food_few_shot_time, food_few_shot_score))

# Hostpital Test
Finally, we conduct the exact same experiment again, but this time on the Hospital dataset.

For perspective: A single row (19 attributes) took around 221.656 seconds.

In [None]:
hospital_zero_shot_time, hospital_zero_shot_score = error_detection_hospital.zero_shot(n_samples=MAXIMUM_ROW_COUNT, grammar=ErrorDetection.GRAMMAR_YES_OR_NO)
print(f'Time: %s, f1-score: %s'%(hospital_zero_shot_time, hospital_zero_shot_score))

And now few-shot.

In [None]:
hospital_few_shot_time, hospital_few_shot_score = error_detection_hospital.few_shot(n_samples=MAXIMUM_ROW_COUNT, example_count=2, grammar=ErrorDetection.GRAMMAR_YES_OR_NO)
print(f'Time: %s, f1-score: %s'%(hospital_few_shot_time, hospital_few_shot_score))

# Custom Datasets

It is also possible to run error detection prompts on custom data. To do so we first need to create the appropriate datasets for the queries and the ground truth as dataframes. The following two code cells highlight one way to achieve this first step.

In [None]:
import pandas as pd

dirty_data = [["England", "Kyoto"], ["USA", "Miami"], ["Spain", "Paris"], ["Japan", "Kyoto"], ["Germany", "Hanover"], ["Netherlands", "Groningen"], ["Turkey", "xyz"], ["South Africa", "Queenstown"], ["USA", "Queens"], ["China", "Dubai"], ["Egypt", "New Cairo"], ["Mali", "Casablanca"]] 
dirty_dataframe = pd.DataFrame(dirty_data, columns=["Country", "City"]) 
dirty_dataframe

In [None]:
clean_data = [["England", "Greenwich"], ["USA", "Miami"], ["Spain", "Barcelona"], ["Japan", "Kyoto"], ["Germany", "Hannover"], ["Netherlands", "Groningen"], ["Turkey", "Istanbul"], ["South Africa", "Queenstown"], ["USA", "Queens"], ["China", "Shanghai"], ["Egypt", "New Cairo"], ["Mali", "Bamako"]] 
clean_dataframe = pd.DataFrame(clean_data, columns=["Country", "City"]) 
clean_dataframe

With the dataframes ready it is time to create the dataset object used to run the experiment.

In [None]:
from experiment.errorDetection import CustomDataSet

custom_data_set = CustomDataSet(dirty_dataframe, clean_dataframe, "Cities")

from experiment.errorDetection import ErrorDetection

error_detection_cities = ErrorDetection(dataset=custom_data_set)

Finally, let's try our custom error detection out! The prompts used, the examples generated and llamas responses are all logged and can be viewed in logs/[timestamp]/experiments-results.csv for the results or logs/[timestamp]/responses/[id].json for the prompts and responses. Pro tip: format the json file (using alt+shift+f) to read it.

In [None]:
# zero shot
cities_zero_shot_time, cities_zero_shot_score = error_detection_cities.zero_shot(n_samples=1)
print(f'Time: %s, f1-score: %s'%(cities_zero_shot_time, cities_zero_shot_score))

In [None]:
# few shot
cities_few_shot_time, cities_few_shot_score = error_detection_cities.few_shot(n_samples=1, example_count=2)
print(f'Time: %s, f1-score: %s'%(cities_few_shot_time, cities_few_shot_score))

# Our Custom Dataset

We have our own expanded custom datasets with custom examples to generate example strings from.

In [1]:
import pandas as pd

clean_data = pd.read_csv("data/error_detection/custom/clean_dataframe.csv")
clean_data

Unnamed: 0,Country,City,Population,ISO 3166-1 Alpha3 Code
0,England,Greenwich,63500,GBR
1,United States of America,Miami,436000,USA
2,Spain,Barcelona,5850000,ESP
3,Japan,Kyoto,1460000,JPN
4,Philippines,Manila,13500000,PHL
5,Germany,Hannover,538000,DEU
6,Netherlands,Groningen,235000,NLD
7,Türkiye,Istanbul,15800000,TUR
8,South Africa,Queenstown,68900,ZAF
9,United States of America,Phoenix,1650000,USA


In [2]:
dirty_data = pd.read_csv("data/error_detection/custom/dirty_dataframe_typos.csv")
dirty_data

Unnamed: 0,Country,City,Population,ISO 3166-1 Alpha3 Code
0,England,Grenwich,63500,GBR
1,United States of America,Miami,436000,USA
2,Spain,Barcellona,5850000,ESP
3,Japan,Kyoto,1460000,JPN
4,Philippines,Manila,13500000,PHL
5,Germany,Hanover,538000,DEU
6,Netherlands,Groningen,235000,NLD
7,Türkiye,Istanbul,15800000,TUR
8,South Africa,Queenstown,68900,ZAF
9,United States of America,Phoenix,1650000,USA


As for the few examples:

In [3]:
examples_clean_data = pd.read_csv("data/error_detection/custom/examples_clean_dataframe.csv")
examples_clean_data

Unnamed: 0,Country,City,Population,ISO 3166-1 Alpha3 Code
0,Vietnam,Hanoi,8050000,VNM
1,New Zealand,Wellington,441000,NZL
2,Saint Lucia,Castries,20000,LCA
3,Uruguay,Montevideo,1950000,URY
4,Australia,Canberra,396000,AUS


In [4]:
examples_dirty_data = pd.read_csv("data/error_detection/custom/examples_dataframe_typos.csv")
examples_dirty_data

Unnamed: 0,Country,City,Population,ISO 3166-1 Alpha3 Code
0,Vietnan,Hanoi,8050000,VNM
1,New Zealand,Wellington,441000,NZL
2,Saint Lucia,Castryes,20000,LCA
3,Uruguay,Montevideo,1950000,URY
4,Australia,Canberra,396000,AUS


So let us start s small experiment!

In [5]:
from experiment.errorDetection import CustomDataSet

custom_data_set = CustomDataSet(dirty_data, clean_data, "Cities")
examples_custom_data_set = CustomDataSet(examples_dirty_data, examples_clean_data, "Examples for Cities")

from experiment.errorDetection import ErrorDetection

error_detection_expanded_cities = ErrorDetection(dataset=custom_data_set)

from_string grammar:



root ::= root_1 
root_1 ::= [Y] [e] [s] | [N] [o] 


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
import warnings
warnings.filterwarnings('ignore')

# few shot
expanded_cities_few_shot_time, expanded_cities_few_shot_score = error_detection_expanded_cities.few_shot(
    n_samples=20, custom_examples_dataset=examples_custom_data_set)
print(f'Time: %s, f1-score: %s'%(expanded_cities_few_shot_time, expanded_cities_few_shot_score))

ErrorDetection (INFO):	Started few shot for 20 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted ', max=80)

ErrorDetection (INFO):	Finished  in 00:02:21


Time: 141.85657858848572, f1-score: 0.16438356164383564


In [6]:
import warnings
warnings.filterwarnings('ignore')

# zero shot
expanded_cities_zero_shot_time, expanded_cities_zero_shot_score = error_detection_expanded_cities.zero_shot(
    n_samples=20)
print(f'Time: %s, f1-score: %s'%(expanded_cities_zero_shot_time, expanded_cities_zero_shot_score))

ErrorDetection (INFO):	Started zero shot for 20 rows


IntProgress(value=0, description='Attributes Prompted ', max=80)

ErrorDetection (INFO):	Finished  in 00:06:42


Time: 402.8859815597534, f1-score: 0.1904761904761905


# Custom Prompt

You can also create custom prompts to test out on any datasets that you want. Make sure you include the following placeholders for each inference type:

1. **attr** : The attribute you want to prompt
2. **context** : The whole row
3. **example** (only for Few-Shot) : The example rows for the model

**!! Due to formatting specifications on our side, the placeholders must follow the names given exactly !!**

## Defaults
Zero-Shot: "Is there an error in {attr}?\n{context}?" \
Few-Shot : "Is there an error in {attr}?\n\n{example}\n\n{context}?"

In [None]:
custom_prompt_zero_shot = "First take a deep breath. Is there an error in {attr}?\n{context}?"
custom_prompt_few_shot = "First take a deep breath. Is there an error in {attr}?\n\n{example}\n\n{context}"

# zero shot
cities_zero_shot_time, cities_zero_shot_score = error_detection_cities.zero_shot(n_samples=1, prompt_template=custom_prompt_zero_shot)
# few shot
cities_few_shot_time, cities_few_shot_score = error_detection_cities.few_shot(n_samples=1, example_count=1, prompt_template=custom_prompt_few_shot)

In [None]:
# zero shot
print(f'Time: %s, f1-score: %s'%(cities_zero_shot_time, cities_zero_shot_score))
# few shot
print(f'Time: %s, f1-score: %s'%(cities_few_shot_time, cities_few_shot_score))

# Evaluation


<!-- Finally, you can display your results in Plotly Express diagrams. All you have to do is save your results and use them in the following code block. Good luck! -->
To evaluate the results, we ran each inference type multiple times to obtain a stable set of data points. You can change the parameters if you want:

1. **ITERATION_AMOUNT** : How often the experiment should be carried out
2. **MAXIMUM_ROW_COUNT** : The amount of rows for each experiment

In [None]:
ITERATION_AMOUNT = 10
MAXIMUM_ROW_COUNT = 2

results = pd.DataFrame([], columns=["Dataset", "Type", "Time", "F1-Score"])
for i in range(ITERATION_AMOUNT):
        runtime_zeroshot, f1_zeroshot = error_detection_cities.zero_shot(
        prompt_template=custom_prompt_zero_shot,
        n_samples=MAXIMUM_ROW_COUNT,
        log_id=f"_p1_{i}",
        )

        runtime_fewshot, f1_fewshot = error_detection_cities.few_shot(
        prompt_template=custom_prompt_few_shot,
        n_samples=MAXIMUM_ROW_COUNT,
        log_id=f"_p1_{i}",
        )
        
        zero_res = pd.DataFrame(
            [["Cities", "ZS", runtime_zeroshot, f1_zeroshot]],
            columns=["Dataset", "Type", "Time", "F1-Score"],
        )
        few_res = pd.DataFrame(
            [["Cities", "FS", runtime_fewshot, f1_fewshot]],
            columns=["Dataset", "Type", "Time", "F1-Score"],
        )

        results = pd.concat([results, zero_res, few_res], ignore_index=True)

## Graphs

We then presented the results in box plots. The top graph shows the time required for each experiment and the lower shows the f1 score achieved for each experiment.
You can also interact with each one. (e.g. Zoom in)

In [None]:
import plotly.graph_objects as go
import plotly.express as px

flightData = pd.read_csv('./analysis/data/flight_10_5.csv')
flightData = flightData.sort_values(['Dataset', 'Type'], ascending=[True, True])
flightData.Time = flightData.Time / 60

fig = go.Figure()
fig = px.box(flightData, y = "Time", color = "Type")
fig.show()
fig = px.box(flightData, y = "F1-Score", color = "Type")
fig.show()