# Results interpreter
When performing the evaluation we decided to store the results for each question in json files on the results forlder, so we can review specific questions and see how good or wht kind of mistakes is commiting the system. To properly understand this let's check first the structure of the stored results:
We are working with a filtered organized version of QALD9, where we only keep questions that have a sparql composed by one single triple. The questions are organized on the type of expected results and required operations to achieve them, resulting in 4 subsets: singular, multiple, boolean and aggregation. Each subset stores the results for the test and train datasets of our QALD9 subset.
The results are a list of the questions where each question contains:
- TP : number of true positive values in the question
- FN : number of false negative values in the question
- FP : number of false positive values in the question
- correct: if the answer to the question can be considered correct or not (implies that FN=FP=0)
- notes : Notes about the question, we annotate if any error was found when testing the question, detailing the exception
- error : Flag to determine if the question contained an error (only two error cases: question has no english translation and the system returned an error, this is usually due to number of tokens limitations, questions with this flag set to true won't be taken into account)
- actual_answers: list of answers returned by the system
- expected_answers: list of answers expected by qald9 (generated from the answers key of the question)
Here is an expample of a results file structure:

{

    prompt:'Optional, The prompt related to the evaluation',
    system: 'Optional, the system related to the evaluation',
    multiple : {
        train_results: {
            'question id' : {
                TP:1,
                FP:0,
                FN:1,
                correct:False,
                notes: None
                error: False,
                actual_answers: ['Obama']
                expected_answers:['Obama', 'Bush']
                }
            },
        test_results: {...}
        },
    boolean: {...},
    singular: {...},
    aggregation: {...}

}
To interpret this results, we will generate the following statistics: precision, recall, f1 and percentage of correct answers for the test results, the train results and the general (test and train) for each subset.

First let's define functions to achieve this...

In [1]:
# imports
import sys
import os
  
current = os.path.dirname(os.path.abspath(''))
parent_directory = os.path.dirname(current)

sys.path.append(parent_directory)

from utils.Metrics_utils import get_f1, get_precision, get_recall
from utils.Json_utils import read_json

In [2]:
def obtain_count_results(results:dict):
    # expected input is train_results or test results of a given subset
    # for each question
    TP = FP = FN = correct = incorrect = 0
    
    for question_id, stats in results.items():
        if not stats.get('error'):
            TP = TP + stats.get('TP')
            FP = FP + stats.get('FP')
            FN = FN + stats.get('FN')
            if stats.get('correct'):
                correct = correct + 1
            else:
                incorrect = incorrect + 1
    
    return TP, FP, FN, correct, incorrect

def print_metrics(TP, FP, FN, correct, incorrect):
    print('TP: ', TP)
    print('FP: ', FP)
    print('FN: ', FN)
    print('number correct: ', correct)
    print('number incorrect: ', incorrect)
    print('Correct answers (%): ', ((correct/(correct + incorrect))*100))
    print('Precision: ', get_precision(TP,FP))
    print('Recall: ', get_recall(TP,FN))
    print('F1: ', get_f1(TP, FP, FN))

def obtain_subset_results(subset:dict, subset_name:str):
    print('Obtaining metrics for the subset: ', subset_name)
    train_TP, train_FP, train_FN, train_correct, train_incorrect = obtain_count_results(subset.get('train_results'))
    print('Train set results:')
    print_metrics(train_TP, train_FP, train_FN, train_correct, train_incorrect)
    test_TP, test_FP, test_FN, test_correct, test_incorrect = obtain_count_results(subset.get('test_results'))
    print('Test set results:')
    print_metrics(test_TP, test_FP, test_FN, test_correct, test_incorrect)
    print('TOTAL results:')
    print_metrics(train_TP + test_TP, train_FP + test_FP, train_FN + test_FN, train_correct + test_correct, train_incorrect + test_incorrect)
    
def interpret_results(results_filename):
    print('Interpreting results...')
    results_data = read_json(results_filename)
    for key, data in results_data.items():
        if type(data) is dict:
            print('##############################################################')
            obtain_subset_results(data, key)
            print('##############################################################')

Now we are ready, let's start with the results
## Prompting experiment
### Prompt 1
Prompt where the aggregation operations are asked to be performed by GPT.

In [3]:
interpret_results('../results/prompt_1_gpt_operations_results.json')

Interpreting results...
##############################################################
Obtaining metrics for the subset:  boolean
Train set results:
TP:  12
FP:  4
FN:  9
number correct:  12
number incorrect:  9
Correct answers (%):  57.14285714285714
Precision:  0.75
Recall:  0.5714285714285714
F1:  0.6486486486486487
Test set results:
TP:  1
FP:  0
FN:  0
number correct:  1
number incorrect:  0
Correct answers (%):  100.0
Precision:  1.0
Recall:  1.0
F1:  1.0
TOTAL results:
TP:  13
FP:  4
FN:  9
number correct:  13
number incorrect:  9
Correct answers (%):  59.09090909090909
Precision:  0.7647058823529411
Recall:  0.5909090909090909
F1:  0.6666666666666666
##############################################################
##############################################################
Obtaining metrics for the subset:  aggregation
Train set results:
TP:  3
FP:  36
FN:  9
number correct:  3
number incorrect:  9
Correct answers (%):  25.0
Precision:  0.07692307692307693
Recall:  0.25
F1:  0

### Prompt 2
Prompt where the aggregation operations are performed by the system, gpt identifies th elements to use for the operation.

In [4]:
interpret_results('../results/prompt_2_manual_operations_results.json')

Interpreting results...
##############################################################
Obtaining metrics for the subset:  boolean
Train set results:
TP:  15
FP:  0
FN:  5
number correct:  15
number incorrect:  5
Correct answers (%):  75.0
Precision:  1.0
Recall:  0.75
F1:  0.8571428571428571
Test set results:
TP:  0
FP:  0
FN:  1
number correct:  0
number incorrect:  1
Correct answers (%):  0.0
Precision:  0
Recall:  0
F1:  0
TOTAL results:
TP:  15
FP:  0
FN:  6
number correct:  15
number incorrect:  6
Correct answers (%):  71.42857142857143
Precision:  1.0
Recall:  0.7142857142857143
F1:  0.8333333333333334
##############################################################
##############################################################
Obtaining metrics for the subset:  aggregation
Train set results:
TP:  4
FP:  13
FN:  8
number correct:  4
number incorrect:  8
Correct answers (%):  33.33333333333333
Precision:  0.23529411764705882
Recall:  0.3333333333333333
F1:  0.27586206896551724
Test 