# Setup

The llama setup was moved to __experiment/setupExperiment.py__ inside the __init__() function.

## Load Data

The loading and handling of the data sets is handled by the external DataSet class. This class provides access to all the datasets as datafames via their Subclasses, such as the Flights class. Each Subclass provides access to the data as DataFrames via the .get(Boolean) function. The Boolean controls whether the dirty or clean set should be returned.

Each DataSet Subclass also provides the following functionalities:
* random_sample(amount) generates two DataFrames, a dirty set and a clean set, of the given size with random (non-duplicate) sample rows.
* generate_examples(column_id, amount) generates an example string of 'amount' random sample rows. Whether or not there is an error in the column of the given id is appended as well. 


In [9]:
from DataSet import Flights, Food, Hospital
Flights().get(True)

Unnamed: 0,src,flight,sched_dep_time,act_dep_time,sched_arr_time,act_arr_time
0,aa,AA-3859-IAH-ORD,7:10 a.m.,7:16 a.m.,9:40 a.m.,9:32 a.m.
1,aa,AA-1733-ORD-PHX,7:45 p.m.,7:58 p.m.,10:30 p.m.,
2,aa,AA-1640-MIA-MCO,6:30 p.m.,,7:25 p.m.,
3,aa,AA-518-MIA-JFK,6:40 a.m.,6:54 a.m.,9:25 a.m.,9:28 a.m.
4,aa,AA-3756-ORD-SLC,12:15 p.m.,12:41 p.m.,2:45 p.m.,2:50 p.m.
...,...,...,...,...,...,...
2371,world-flight-tracker,UA-3099-PHX-PHL,11:55 a.m.,11:43 a.m.,6:17 p.m.,5:38 p.m.
2372,world-flight-tracker,AA-4198-ORD-CLE,10:40 a.m.,10:54 a.m.,12:55 p.m.,12:50 p.m.
2373,world-flight-tracker,CO-45-EWR-MIA,4:00 p.m.,3:58 p.m.,7:05 p.m.,6:36 p.m.
2374,world-flight-tracker,AA-3809-PHX-LAX,6:00 a.m.,6:10 a.m.,6:40 a.m.,6:19 a.m.


In [10]:
Hospital().random_sample(5)[0]

Unnamed: 0,provider number,hospital name,address1,address2,address3,city,state,zip code,county name,phone number,hospital type,hospital owner,emergency service,condition,measure code,measure name,score,sample,stateavg
35,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,,,DOTHAN,AL,36302,HOUSTON,3347938701,Acute Care Hospitals,Government - Hospital District or Authority,Yes,Pneumonia,PN-7,Pneumonia Patients Assessed and Given Influenz...,84%,160 patients,AL_PN-7
991,10050,ST VINCENTS BLOUNT,150 GILBREATH DRIVE,,,ONEONTA,AL,35121,BLOUNT,2052743000,Acute Care Hospitals,Voluntary non-profit - Private,Yes,Pneumonia,PN-2,Pneumonia Patients Assessed and Given Pneumoco...,74%,92 patients,AL_PN-2
249,10015,SOUTHWEST ALABAMA MEDICAL CENTER,33700 HIGHWAY 43,,,THOMASVILLE,AL,36784,CLARKE,3346366221,Acute Care Hospitals,Government - Federal,Yes,Heart Attack,AMI-7A,Heart Attack Patients Given Fibrinolytic Medic...,,0 patients,AL_AMI-7A
669,10033,UNIVERSITY OF ALABAMA HOSPITAL,619 SOUTH 19TH STREET,,,BIRMINGHAM,AL,35233,JEFFERSON,2059344011,Acute Care Hospitals,Government - State,Yes,Heart Attack,AMI-3,Heart Attack Patients Given ACE Inhibitor or A...,91%,87 paxienxs,AL_AMI-3
194,10010,MARSHALL MEDICAL CENTER NORTH,8000 ALABAMA HIGHWAY 69,,,GUNTERSVILLE,AL,35976,MARSHALL,2565718000,Acute Care Hospitals,Government - Hospital District or Authority,Yes,Surgical Infection Prevention,SCIP-VTE-2,patients who got treatment at the right time ...,84%,56 patients,AL_SCIx-VTE-2


In [11]:
print(Food().generate_examples(0, 3))

akaname: VOLARE, inspectionid: 569451, city: CHICAGO, state: IL, results: Pass, longitude: -87.62260381, latitude: 41.89165221, inspectiondate: 20110714, risk: Risk 1 (High), location: (41.89165221441017, -87.62260381408842), license: 57820.0, facilitytype: Restaurant, address: 201 E GRAND AVE , inspectiontype: Canvass, dbaname: VOLARE, zip: 60611.0? No
akaname: MR. PHILLY, inspectionid: 54254, city: CHICAGO, state: IL, results: Pass, longitude: -87.75783481, latitude: 41.89502331, inspectiondate: 20100302, risk: Risk 2 (Medium), location: (41.89502331496054, -87.75783480830066), license: 1649619.0, facilitytype: Restaurant, address: 5254 W CHICAGO AVE , inspectiontype: Complaint Re-Inspection, dbaname: MR. PHILLY, zip: 60651.0? No
akaname: Earle, inspectionid: 543350, city: CHICAGO, state: IL, results: Pass, longitude: -87.66777712, latitude: 41.78230211, inspectiondate: 20110504, risk: Risk 1 (High), location: (41.782302105880994, -87.66777711762356), license: 23031.0, facilitytype: 

## Prompt Table Zero Shot
The following examples show how to first initialise the experiment and then how to promt llama for error detection using our external, dedicated python ErrorDetection class. The whole creation of the llama model is inside the construction of the ErrorDetectionModel.

In [None]:
from experiment.errorDetection import ErrorDetection
ed = ErrorDetection(dataset=Flights())

In [13]:
%%capture --no-stdout --no-display

ed.zero_shot(n_samples=2)

ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=12)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:05
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:05


(5.0234503746032715, 0.0)

In [14]:
%%capture --no-stdout --no-display

ed.few_shot(n_samples=2)

ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=12)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:01:32
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:01:32


(92.51563501358032, 0.2222222222222222)

# Experiments (Setup)

This is how we conduct the time and score peformance experiments for error detection on the three main datasets: Flights, Food and Hospital.

In [None]:
from typing import Tuple
import numpy as np
import pandas as pd
from ipywidgets import IntProgress
from IPython.display import display
import time

from experiment.errorDetection import ErrorDetection

# maximum number of rows that will be evaluated
MAXIMUM_ROW_COUNT = 2 
# print debug messages such as the prompts and responses
DEBUG_MESSAGES = True 

error_detection_flights = ErrorDetection(dataset=Flights())
error_detection_food = ErrorDetection(dataset=Food())
error_detection_hospital = ErrorDetection(dataset=Hospital())

## Flight Test
Let's first compute the F1 score of Llama resposes by zero-shotting a sample of `MAXIMUM_ROW_COUNT` amount of random rows of the flight table.

For perspective: Prompting a single row (6 attributes in this case) took around 32.192 seconds in one case and around 5.017 in another. It varies quite a bit.

In [8]:
flights_zero_shot_time, flights_zero_shot_score = error_detection_flights.zero_shot(n_samples=MAXIMUM_ROW_COUNT)
print(f'Time: %s, f1-score: %s'%(flights_zero_shot_time, flights_zero_shot_score))

ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=12)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:05
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:05
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:05
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:05


Time: 5.443749666213989, f1-score: 0.0


Now the same experiment with few-shotting.

In [9]:
flights_few_shot_time, flights_few_shot_score = error_detection_flights.few_shot(n_samples=MAXIMUM_ROW_COUNT, example_count=2)
print(f'Time: %s, f1-score: %s'%(flights_few_shot_time, flights_few_shot_score))

ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=12)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:02:06
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:02:06
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:02:06
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:02:06


Time: 126.298668384552, f1-score: 0.6666666666666666


# Food Test

Next we conduct the same experiment on the Food dataset and we again start with zero-shot. 

For perspective: A single row (16 attributes) took around 137.798 seconds.

In [10]:
food_zero_shot_time, food_zero_shot_score = error_detection_food.zero_shot(n_samples=MAXIMUM_ROW_COUNT)
print(f'Time: %s, f1-score: %s'%(food_zero_shot_time, food_zero_shot_score))

ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=32)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:05:13
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:05:13
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:05:13
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:05:13


Time: 313.8261339664459, f1-score: 0.0


Next up is few shot on this dataset.

In [11]:
food_few_shot_time, food_few_shot_score = error_detection_food.few_shot(n_samples=MAXIMUM_ROW_COUNT, example_count=2)
print(f'Time: %s, f1-score: %s'%(food_few_shot_time, food_few_shot_score))

ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=32)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:08:09
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:08:09
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:08:09
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:08:09


Time: 489.44751358032227, f1-score: 0.0


# Hostpital Test
Finally, we conduct the exact same experiment again, but this time on the Hospital dataset.

For perspective: A single row (19 attributes) took around 221.656 seconds.

In [12]:
hospital_zero_shot_time, hospital_zero_shot_score = error_detection_hospital.zero_shot(n_samples=MAXIMUM_ROW_COUNT)
print(f'Time: %s, f1-score: %s'%(hospital_zero_shot_time, hospital_zero_shot_score))

ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=38)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:06:19
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:06:19
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:06:19
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:06:19


Time: 379.8894290924072, f1-score: 0.13333333333333336


And now few-shot.

In [13]:
hospital_few_shot_time, hospital_few_shot_score = error_detection_hospital.few_shot(n_samples=MAXIMUM_ROW_COUNT, example_count=2)
print(f'Time: %s, f1-score: %s'%(hospital_few_shot_time, hospital_few_shot_score))

ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=38)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:09:32
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:09:32
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:09:32
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:09:32


Time: 572.1538238525391, f1-score: 0.37499999999999994


# Custom Datasets

It is also possible to run error detection promts on custom data. To do so we first need to create the appropriate datasets for the queries and the ground truth as dataframes. The following two code cells highlight one way to achieve this first step.

In [4]:
import pandas as pd

dirty_data = [["England", "Kyoto"], ["USA", "Miami"], ["Spain", "Paris"]] 
dirty_dataframe = pd.DataFrame(dirty_data, columns=["Country", "City"]) 
dirty_dataframe

Unnamed: 0,Country,City
0,England,Kyoto
1,USA,Miami
2,Spain,Paris


In [5]:
clean_data = [["England", "Greenwich"], ["USA", "Miami"], ["Spain", "Barcelona"]] 
clean_dataframe = pd.DataFrame(clean_data, columns=["Country", "City"]) 
clean_dataframe

Unnamed: 0,Country,City
0,England,Greenwich
1,USA,Miami
2,Spain,Barcelona


With the dataframes ready it is time to create the dataset object used to run the experiment.

In [None]:
from DataSet import CustomDataSet

custom_data_set = CustomDataSet(dirty_dataframe, clean_dataframe, "Cities")

from experiment.errorDetection import ErrorDetection

error_detection_cities = ErrorDetection(dataset=custom_data_set)

Finally, let's try our custom error detection out! The promts used, the examples generated and llamas responses are all logged and can be viewed in logs/[timestamp]/promt-results.csv for the results or logs/[timestamp]/responses/[id].json for the promts and responses. Pro tipp: format the json file (using alt+shift+f) to read it.

In [7]:
# zero shot
cities_zero_shot_time, cities_zero_shot_score = error_detection_cities.zero_shot(n_samples=1)
print(f'Time: %s, f1-score: %s'%(cities_zero_shot_time, cities_zero_shot_score))


ErrorDetection (INFO):	Started zero shot for 1 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=2)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:01


Time: 1.8703770637512207, f1-score: 0.0


In [8]:
# few shot
cities_few_shot_time, cities_few_shot_score = error_detection_cities.few_shot(n_samples=1, example_count=1)
print(f'Time: %s, f1-score: %s'%(cities_few_shot_time, cities_few_shot_score))

ErrorDetection (INFO):	Started few shot for 1 rows with 1 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=2)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:00


Time: 0.3987538814544678, f1-score: 0.0


# TODO:

- retrieve the errors from the answer in an automatic way
- compute recall/f1 etc. from clean data
- test different prompts
- test different hyperparameters (?)
- automate experiments for different parameters

