In this notebook, we will 
- load JSON data of a single trial from directory
- load the matching evaluation response by GPT-4 Omni for any of the 4 generation models 
- calculate the precision, recall and f1 score for that matched set 

In [1]:
import pandas as pd 
import numpy as np 
import json 
import os 

## Step 1: Load JSON data of a trial 
Let's load a CT-Pub trial.

In [13]:
import os

# Specify the directory path
directory_path = 'results/ct_pub/'

# List all files in the directory
file_list = os.listdir(directory_path)

#we will only look at the first trial file as example
for file in file_list:
    # Read the JSON file
    with open(directory_path + file) as f:
        data = json.load(f)
    
    #stopping the loop here after the first trial is loaded
    break 

In [15]:
# content of the first file 
print(json.dumps(data, indent=4))

{
    "trial_id": "NCT02003963",
    "trial_group": "obesity",
    "generated-by-gpt4-omni-zs": {
        "gen-response": "Age, Gender, Menstrual Status, BMI Percentile, Language Proficiency, Willingness to Accept Randomization, Pregnancy Status, Mental Health Hospitalization History, Ambulation Ability, Cardiac Health, Cardiovascular Disease History, Musculoskeletal Health, Presence of Implanted Medical Devices, Ability to Complete Baseline Testing, Medical Conditions Affecting Video Game Play, Family History of Epileptic Seizures, Commitment to Study Sessions",
        "evaluated-by-gpt4-omni": {
            "matched_features": [
                [
                    "Age",
                    "Age"
                ],
                [
                    "BMI (percentile)",
                    "BMI Percentile"
                ]
            ],
            "remaining_reference_features": [
                "Race",
                "Weight",
                "Height",
                "BMI

## Step 2: Extract the matching results for a generation model
Let's look at the GPT-4 evaluation of gpt4-omni-ts (baseline measures generated by GPT-4 omni in three-shot setting and then evaluated by GPT-4 omni).

In [5]:
matched_features = data['generated-by-gpt4-omni-ts']['evaluated-by-gpt4-omni']['matched_features']
remaining_reference_features = data['generated-by-gpt4-omni-ts']['evaluated-by-gpt4-omni']['remaining_reference_features']
remaining_candidate_features = data['generated-by-gpt4-omni-ts']['evaluated-by-gpt4-omni']['remaining_candidate_features']

In [16]:
# NOTE: in each pair, the first one comes from Reference (API/Paper) 
# and the second one comes from Candidate(generated LLM features)
print('Matched Features: ', matched_features)

Matched Features:  [['Age', 'Age'], ['BMI (percentile)', 'BMI percentile'], ['Total body fat', 'Body fat percentage'], ['MRI-Visceral adipose tissue', 'Visceral fat'], ['SBP', 'Blood pressure']]


In [7]:
print('Remaining Reference Features: ', remaining_reference_features)


Remaining Reference Features:  ['Race', 'Weight', 'Height', 'BMI (z-score)', 'Waist circumference', 'Leg fat', 'Gynoid fat', 'Android fat', 'Trunk fat', 'BMC', 'BMD trunk', 'BMD spine', 'BMD leg', 'MRI-Subcutaneous adipose tissue', 'MRI-Total adipose tissue', 'DBP', 'Cholesterol', 'HDL-C', 'LDL-C', 'Triglycerides', 'Glucose', 'Insulin']


In [8]:
print('Remaining Candidate Features: ', remaining_candidate_features)

Remaining Candidate Features:  ['Sex', 'BMI', 'Ethnicity', 'Socioeconomic status', 'Physical activity level', 'Dietary habits', 'Heart rate', 'Self-confidence', 'Quality of life', 'Screen time', 'Family history of cardiovascular disease', 'Mental health status', 'Menstrual status']


## Step 3: Calculate Precision, Recall and F1 score for this matching

In [17]:
def match_to_score(matched_pairs, remaining_reference_features, remaining_candidate_features):
    """
    Calculates precision, recall, and F1 score based on the given matched pairs and remaining features.

    Parameters:
    matched_pairs (list): A list of matched feature pairs.
    remaining_reference_features (list): A list of remaining reference features.
    remaining_candidate_features (list): A list of remaining candidate features.

    Returns:
    dict: A dictionary containing the precision, recall, and F1 score.
    """
    precision = len(matched_pairs) / (len(matched_pairs) + len(remaining_candidate_features)) # TP/(TP+FP)
    recall = len(matched_pairs) /  (len(matched_pairs) + len(remaining_reference_features)) #TP/(TP+FN)
    
    if precision == 0 or recall == 0:
        f1 = 0
    else:
        f1 = 2 * (precision * recall) / (precision + recall) # F1

    return {"precision": precision, "recall": recall, "f1": f1}

In [18]:
# Calculate the precision, recall, and F1 score
result = match_to_score(matched_features, remaining_reference_features, remaining_candidate_features)

In [19]:
# Print the results
print('Precision: ', result['precision'])
print('Recall: ', result['recall'])
print('F1 Score: ', result['f1'])

Precision:  0.2777777777777778
Recall:  0.18518518518518517
F1 Score:  0.22222222222222224
