# 03 - Evaluating data loss

The data present in the Synthea records can be lost during processing prior to entering scoring. There are two steps:

* Application of the large language model to generate free-text notes might fail to include all the identifiers.
* Given the free-text notes, the NER might fail to accurately extract all identifiers.

This notebook evaluates the loss of identifiers in the Synthea-LLM-NER pipeline.

In [None]:
import json
import os

import pandas as pd

In [None]:
import privacy_fingerprint.generate.synthea as synthea
import privacy_fingerprint.extract.aws_comprehend as aws
from privacy_fingerprint.common import compare_common_records

In [None]:
# The dataset will be loaded from the directory created in notebook 2.
output_dir = "../experiments/02_generate_dataset"

with open(os.path.join(output_dir, "synthea_dataset.json")) as fp:
    synthea_records = json.load(fp)

with open(os.path.join(output_dir, "llm_dataset.json")) as fp:
    llm_results = json.load(fp)

with open(os.path.join(output_dir, "ner_dataset.json")) as fp:
    ner_records = json.load(fp)

In [None]:
# The format of the Synthea and NER records are different
# and must be standardised to enable comparison

common_results = synthea.prepare_common_records(
    synthea.DEFAULT_IDENTIFIERS, synthea_records
)
common_ner_results = aws.prepare_common_records(
    aws.DEFAULT_IDENTIFIERS, ner_records
)

In [None]:
# Iterating across all records and comparing the data generated by Synthea
# and recovered by the NER for each record

record_comparison_summary = []
for s, n in zip(common_results, common_ner_results):
    overall_score, max_score, summary = compare_common_records(s, n)
    record_comparison_summary.append(summary)

record_comparison_summary = pd.DataFrame(record_comparison_summary)

In [None]:
record_comparison_summary.info()

In [None]:
record_comparison_summary.agg("describe")

In [None]:
record_comparison_summary.plot.box(rot=90, ylabel="Data recovery (%)")