# Setup

The llama setup was moved to __experiment/setupExperiment.py__ inside the __init__() function.

## Load Data

The loading and handling of the data sets is handled by the external DataSet class. This class provides access to all the datasets as datafames via their Subclasses, such as the Flights class. Each Subclass provides access to the data as DataFrames via the .get(Boolean) function. The Boolean controls whether the dirty or clean set should be returned.

Each DataSet Subclass also provides the following functionalities:
* random_sample(amount) generates two DataFrames, a dirty set and a clean set, of the given size with random (non-duplicate) sample rows.
* generate_examples(column_id, amount) generates an example string of 'amount' random sample rows. Whether or not there is an error in the column of the given id is appended as well. 


In [4]:
from DataSet import Flights, Food, Hospital
Flights().get(True)

Unnamed: 0,src,flight,sched_dep_time,act_dep_time,sched_arr_time,act_arr_time
0,aa,AA-3859-IAH-ORD,7:10 a.m.,7:16 a.m.,9:40 a.m.,9:32 a.m.
1,aa,AA-1733-ORD-PHX,7:45 p.m.,7:58 p.m.,10:30 p.m.,
2,aa,AA-1640-MIA-MCO,6:30 p.m.,,7:25 p.m.,
3,aa,AA-518-MIA-JFK,6:40 a.m.,6:54 a.m.,9:25 a.m.,9:28 a.m.
4,aa,AA-3756-ORD-SLC,12:15 p.m.,12:41 p.m.,2:45 p.m.,2:50 p.m.
...,...,...,...,...,...,...
2371,world-flight-tracker,UA-3099-PHX-PHL,11:55 a.m.,11:43 a.m.,6:17 p.m.,5:38 p.m.
2372,world-flight-tracker,AA-4198-ORD-CLE,10:40 a.m.,10:54 a.m.,12:55 p.m.,12:50 p.m.
2373,world-flight-tracker,CO-45-EWR-MIA,4:00 p.m.,3:58 p.m.,7:05 p.m.,6:36 p.m.
2374,world-flight-tracker,AA-3809-PHX-LAX,6:00 a.m.,6:10 a.m.,6:40 a.m.,6:19 a.m.


In [2]:
Hospital().random_sample(5)[0]

Unnamed: 0,provider number,hospital name,address1,address2,address3,city,state,zip code,county name,phone number,hospital type,hospital owner,emergency service,condition,measure code,measure name,score,sample,stateavg
188,10010,MARSHALL MEDICAL CENTER NORTH,8000 ALABAMA HIGHWAY 69,,,GUNTERSVILLE,AL,35976,MARSHALL,2565718000,Acute Care Hospitals,Government - Hospital District or Authority,Yes,Surgical Infection Prevention,SCIP-INF-1,surgery patients who were given an antibiotic ...,95x,66 patients,AL_SCIP-INF-1
879,1xx45,FAYETTE MEDICAL CENTER,1653 TEMPLE AVENUE NORTH,,,FAYETTE,AL,35555,FAYETTE,2059325966,Acute Care Hospitals,Voluntary non-profit - Other,Yes,Heart Attack,AMI-2,Heart Attack Patients Given Aspirin at Discharge,100%,12 patients,AL_AMI-2
883,10045,FAYETTE MEDICAL CENTER,1653 TEMPLE AVENUE NORTH,,,FAYETTE,AL,35555,FAYETTE,2059325966,Acute Care Hospitals,Voluntary non-profit - Other,Yes,Heart Attack,AMI-7A,Heart Attack Patients Given Fibrinolytic Medic...,,0 patients,AL_AMI-7A
502,10022,CHEROKEE MEDICAL CENTER,400 NORTHWOOD DR,,,CENTRE,AL,35960,CHxROKxx,2569275531,Acute Care Hospitals,Voluntary non-profit - Private,Yes,Pneumonia,PN-2,Pneumonia Patients Assessed and Given Pneumoco...,93%,44 paxienxs,AL_PN-2
362,10086,NORTHWEST MEDICAL CENTER,1530 U S HIGHWAY 43,,,WINFIELD,AL,35594,MARION,2054877736,Acute Care Hospitals,Proprietary,Yes,Heart Failure,HF-2,Heart Failure Patients Given an Evaluation of ...,100%,59 patients,AL_HF-2


In [3]:
print(Food().generate_examples(0, 3))

akaname: CASA YARI, inspectionid: 1566833, city: CHICAGO, state: IL, results: Pass, longitude: -87.71012068, latitude: 41.92479632, inspectiondate: 20150827, risk: Risk 1 (High), location: (41.92479632473597, -87.71012067917576), license: 2271391.0, facilitytype: Restaurant, address: 3268 W FULLERTON AVE , inspectiontype: Canvass, dbaname: CASA YARI, zip: 60647.0? No
akaname: nan, inspectionid: 1447907, city: CHICAGO, state: IL, results: Pass, longitude: -87.62040431, latitude: 41.89408538, inspectiondate: 20150721, risk: Risk 3 (Low), location: (41.894085380663164, -87.62040431002175), license: 2397605.0, facilitytype: Restaurant, address: 259 E ERIE ST , inspectiontype: License, dbaname: GREEN RIVER CHICAGO, zip: 60611.0? Yes
akaname: CARLTON AT THE LAKE, INC, inspectionid: 1947037, city: CHICAGO, state: IL, results: Pass, longitude: -87.64854372, latitude: 41.9617548, inspectiondate: 20160722, risk: Risk 1 (High), location: (41.961754803527974, -87.64854372497662), license: 2204338.

## Prompt Table Zero Shot
The following examples show how to first initialise the experiment and then how to prompt llama for error detection using our external, dedicated python ErrorDetection class. The whole creation of the llama model is inside the construction of the ErrorDetectionModel.

In [5]:
from experiment.errorDetection import ErrorDetection
ed = ErrorDetection(dataset=Flights())

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 1: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 2: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 3: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 4: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 5: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 6: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 7: NVIDIA A100-SXM4-40GB, compute capability 8.0
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /home/group-1-23/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-chat-GGUF/snapshots/4458acc949de0a9914c3eab623904d4fe999050a/llama-2-13b-chat.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:     

In [5]:
%%capture --no-stdout --no-display

ed.zero_shot(n_samples=2)

ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=12)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:49


(49.63357424736023, 0.2222222222222222)

In [6]:
%%capture --no-stdout --no-display

ed.few_shot(n_samples=2)

ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=12)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:01:55


(115.48432683944702, 0.8571428571428571)

# Experiments (Setup)

This is how we conduct the time and score peformance experiments for error detection on the three main datasets: Flights, Food and Hospital.

In [6]:
from typing import Tuple
import numpy as np
import pandas as pd
from ipywidgets import IntProgress
from IPython.display import display
import time

from experiment.errorDetection import ErrorDetection

# maximum number of rows that will be evaluated
MAXIMUM_ROW_COUNT = 2 
# print debug messages such as the prompts and responses
DEBUG_MESSAGES = True 

error_detection_flights = ErrorDetection(dataset=Flights())
error_detection_food = ErrorDetection(dataset=Food())
error_detection_hospital = ErrorDetection(dataset=Hospital())

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /home/group-1-23/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-chat-GGUF/snapshots/4458acc949de0a9914c3eab623904d4fe999050a/llama-2-13b-chat.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q5_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q5_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q5_K     [  5120,  5120,

## Flight Test
Let's first compute the F1 score of Llama resposes by zero-shotting a sample of `MAXIMUM_ROW_COUNT` amount of random rows of the flight table.

For perspective: Prompting a single row (6 attributes in this case) took around 32.192 seconds in one case and around 5.017 in another. It varies quite a bit.

In [8]:
flights_zero_shot_time, flights_zero_shot_score = error_detection_flights.zero_shot(n_samples=MAXIMUM_ROW_COUNT)
print(f'Time: %s, f1-score: %s'%(flights_zero_shot_time, flights_zero_shot_score))

ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=12)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:40
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:40
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:40
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:40


Time: 40.957926988601685, f1-score: 0.0


Now the same experiment with few-shotting.

In [9]:
flights_few_shot_time, flights_few_shot_score = error_detection_flights.few_shot(n_samples=MAXIMUM_ROW_COUNT, example_count=2)
print(f'Time: %s, f1-score: %s'%(flights_few_shot_time, flights_few_shot_score))

ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=12)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:02:51
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:02:51
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:02:51
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:02:51


Time: 171.1856119632721, f1-score: 0.2222222222222222


# Food Test

Next we conduct the same experiment on the Food dataset and we again start with zero-shot. 

For perspective: A single row (16 attributes) took around 137.798 seconds.

In [10]:
food_zero_shot_time, food_zero_shot_score = error_detection_food.zero_shot(n_samples=MAXIMUM_ROW_COUNT)
print(f'Time: %s, f1-score: %s'%(food_zero_shot_time, food_zero_shot_score))

ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=32)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:05:21
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:05:21
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:05:21
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:05:21


Time: 321.21536803245544, f1-score: 0.0


Next up is few shot on this dataset.

In [11]:
food_few_shot_time, food_few_shot_score = error_detection_food.few_shot(n_samples=MAXIMUM_ROW_COUNT, example_count=2)
print(f'Time: %s, f1-score: %s'%(food_few_shot_time, food_few_shot_score))

ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=32)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:08:11
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:08:11
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:08:11
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:08:11


Time: 491.1071527004242, f1-score: 0.0


# Hostpital Test
Finally, we conduct the exact same experiment again, but this time on the Hospital dataset.

For perspective: A single row (19 attributes) took around 221.656 seconds.

In [12]:
hospital_zero_shot_time, hospital_zero_shot_score = error_detection_hospital.zero_shot(n_samples=MAXIMUM_ROW_COUNT)
print(f'Time: %s, f1-score: %s'%(hospital_zero_shot_time, hospital_zero_shot_score))

ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=38)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:06:15
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:06:15
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:06:15
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:06:15


Time: 375.07872676849365, f1-score: 0.0


And now few-shot.

In [13]:
hospital_few_shot_time, hospital_few_shot_score = error_detection_hospital.few_shot(n_samples=MAXIMUM_ROW_COUNT, example_count=2)
print(f'Time: %s, f1-score: %s'%(hospital_few_shot_time, hospital_few_shot_score))

ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=38)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:09:36
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:09:36
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:09:36
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:09:36


Time: 576.151734828949, f1-score: 0.4


# Custom Datasets

It is also possible to run error detection prompts on custom data. To do so we first need to create the appropriate datasets for the queries and the ground truth as dataframes. The following two code cells highlight one way to achieve this first step.

In [10]:
import pandas as pd

dirty_data = [["England", "Kyoto"], ["USA", "Miami"], ["Spain", "Paris"]] 
dirty_dataframe = pd.DataFrame(dirty_data, columns=["Country", "City"]) 
dirty_dataframe

Unnamed: 0,Country,City
0,England,Kyoto
1,USA,Miami
2,Spain,Paris


In [11]:
clean_data = [["England", "Greenwich"], ["USA", "Miami"], ["Spain", "Barcelona"]] 
clean_dataframe = pd.DataFrame(clean_data, columns=["Country", "City"]) 
clean_dataframe

Unnamed: 0,Country,City
0,England,Greenwich
1,USA,Miami
2,Spain,Barcelona


With the dataframes ready it is time to create the dataset object used to run the experiment.

In [12]:
from DataSet import CustomDataSet

custom_data_set = CustomDataSet(dirty_dataframe, clean_dataframe, "Cities")

from experiment.errorDetection import ErrorDetection

error_detection_cities = ErrorDetection(dataset=custom_data_set)

llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /home/group-1-23/.cache/huggingface/hub/models--TheBloke--Llama-2-13B-chat-GGUF/snapshots/4458acc949de0a9914c3eab623904d4fe999050a/llama-2-13b-chat.Q5_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q5_K     [  5120, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 13824,  5120,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q5_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q5_K     [  5120, 13824,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  5120,     1,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.attn_k.weight q5_K     [  5120,  5120,

Finally, let's try our custom error detection out! The prompts used, the examples generated and llamas responses are all logged and can be viewed in logs/[timestamp]/experiments-results.csv for the results or logs/[timestamp]/responses/[id].json for the prompts and responses. Pro tip: format the json file (using alt+shift+f) to read it.

In [13]:
# zero shot
cities_zero_shot_time, cities_zero_shot_score = error_detection_cities.zero_shot(n_samples=1)
print(f'Time: %s, f1-score: %s'%(cities_zero_shot_time, cities_zero_shot_score))

ErrorDetection (INFO):	Started zero shot for 1 rows
ErrorDetection (INFO):	Started zero shot for 1 rows
ErrorDetection (INFO):	Started zero shot for 1 rows
ErrorDetection (INFO):	Started zero shot for 1 rows
ErrorDetection (INFO):	Started zero shot for 1 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=2)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:03


Time: 3.6455185413360596, f1-score: 0.0


In [14]:
# few shot
cities_few_shot_time, cities_few_shot_score = error_detection_cities.few_shot(n_samples=1, example_count=1)
print(f'Time: %s, f1-score: %s'%(cities_few_shot_time, cities_few_shot_score))

ErrorDetection (INFO):	Started few shot for 1 rows with 1 examples
ErrorDetection (INFO):	Started few shot for 1 rows with 1 examples
ErrorDetection (INFO):	Started few shot for 1 rows with 1 examples
ErrorDetection (INFO):	Started few shot for 1 rows with 1 examples
ErrorDetection (INFO):	Started few shot for 1 rows with 1 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=2)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01


Time: 1.1666619777679443, f1-score: 0.0


# Custom Prompt

You can also create custom prompts to test out on any datasets that you want. Make sure you include the following placeholders for each inference type:

1. **attr** : The attribute you want to prompt
2. **context** : The whole row
3. **example** (only for Few-Shot) : The example rows for the model

**!! Due to formatting specifications on our side, the placeholders must follow the names given exactly !!**

## Defaults
Zero-Shot: "Is there an error in {attr}?\n{context}?" \
Few-Shot : "Is there an error in {attr}?\n\n{example}\n\n{context}?"

In [17]:
custom_prompt_zero_shot = "First take a deep breath. Is there an error in {attr}?\n{context}?"
custom_prompt_few_shot = "First take a deep breath. Is there an error in {attr}?\n\n{example}\n\n{context}"

# zero shot
cities_zero_shot_time, cities_zero_shot_score = error_detection_cities.zero_shot(n_samples=1, prompt_template=custom_prompt_zero_shot)
# few shot
cities_few_shot_time, cities_few_shot_score = error_detection_cities.few_shot(n_samples=1, example_count=1, prompt_template=custom_prompt_few_shot)

ErrorDetection (INFO):	Started zero shot for 1 rows
ErrorDetection (INFO):	Started zero shot for 1 rows
ErrorDetection (INFO):	Started zero shot for 1 rows
ErrorDetection (INFO):	Started zero shot for 1 rows
ErrorDetection (INFO):	Started zero shot for 1 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=2)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:06
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:06
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:06
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:06
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:06
ErrorDetection (INFO):	Started few shot for 1 rows with 1 examples
ErrorDetection (INFO):	Started few shot for 1 rows with 1 examples
ErrorDetection (INFO):	Started few shot for 1 rows with 1 examples
ErrorDetection (INFO):	Started few shot for 1 rows with 1 examples
ErrorDetection (INFO):	Started few shot for 1 rows with 1 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=2)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02


In [16]:
# zero shot
print(f'Time: %s, f1-score: %s'%(cities_zero_shot_time, cities_zero_shot_score))
# few shot
print(f'Time: %s, f1-score: %s'%(cities_few_shot_time, cities_few_shot_score))

Time: 4.02879524230957, f1-score: 0.0
Time: 2.4927690029144287, f1-score: 0.0


# Evaluation


<!-- Finally, you can display your results in Plotly Express diagrams. All you have to do is save your results and use them in the following code block. Good luck! -->
To evaluate the results, we ran each inference type multiple times to obtain a stable set of data points. You can change the parameters if you want:

1. **ITERATION_AMOUNT** : How often the experiment should be carried out
2. **MAXIMUM_ROW_COUNT** : The amount of rows for each experiment

In [18]:
ITERATION_AMOUNT = 10
MAXIMUM_ROW_COUNT = 2

results = pd.DataFrame([], columns=["Dataset", "Type", "Time", "F1-Score"])
for i in range(ITERATION_AMOUNT):
        runtime_zeroshot, f1_zeroshot = error_detection_cities.zero_shot(
        prompt_template=custom_prompt_zero_shot,
        n_samples=MAXIMUM_ROW_COUNT,
        id=f"_p1_{i}",
        )

        runtime_fewshot, f1_fewshot = error_detection_cities.few_shot(
        prompt_template=custom_prompt_few_shot,
        n_samples=MAXIMUM_ROW_COUNT,
        id=f"_p1_{i}",
        )
        
        zero_res = pd.DataFrame(
            [["Cities", "ZS", runtime_zeroshot, f1_zeroshot]],
            columns=["Dataset", "Type", "Time", "F1-Score"],
        )
        few_res = pd.DataFrame(
            [["Cities", "FS", runtime_fewshot, f1_fewshot]],
            columns=["Dataset", "Type", "Time", "F1-Score"],
        )

        results = pd.concat([results, zero_res, few_res], ignore_index=True)

ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:08
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:08
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:08
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:08
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:08
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:06
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:06
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:06
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:06
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:06
  results = pd.concat([results, zero_res, few_res], ignore_index=True)
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:08
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:08
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:08
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:08
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:08
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:03
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:14
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:14
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:14
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:14
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:14
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:03
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:10
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:10
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:10
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:10
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:10
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:07
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:02
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:06
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:06
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:06
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:06
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:06
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:03
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:03
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:05
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:05
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:05
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:05
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:05
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows
ErrorDetection (INFO):	Started zero shot for 2 rows


IntProgress(value=0, description='Attributes Prompted Error Detection Zero Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:10
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:10
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:10
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:10
ErrorDetection (INFO):	Finished Error Detection Zero Shot in 00:00:10
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples
ErrorDetection (INFO):	Started few shot for 2 rows with 2 examples


IntProgress(value=0, description='Attributes Prompted Error Detection Few Shot', max=4)

ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01
ErrorDetection (INFO):	Finished Error Detection Few Shot in 00:00:01


## Graphs

We then presented the results in box plots. The top graph shows the time required for each experiment and the lower shows the f1 score achieved for each experiment.
You can also interact with each one. (e.g. Zoom in)

In [8]:
import plotly.graph_objects as go
import plotly.express as px

flightData = pd.read_csv('./analysis/data/flight_restrict_answer_test_10_5.csv')
flightData = flightData.sort_values(['Dataset', 'Type'], ascending=[True, True])
flightData.Time = flightData.Time / 60

fig = go.Figure()
fig = px.box(flightData, y = "Time", color = "Type")
fig.show()
fig = px.box(flightData, y = "F1-Score", color = "Type")
fig.show()