# 01 - Demonstrating packages in a simple experiment

This notebook demonstrates the end-to-end process of using the privacy fingerprint package to conduct an experiment. Before running this notebook you must have completed all the installation and setup steps in the README. The steps covered are:

* Generate structured dummy data with Synthea
* Convert the structured records to unstructured clinical notes using a large language model (LLM)
* Extract identifiers from the clinical notes using named entity recognition (NER)
* Standardise the NER output
* Run pycorrectmatch to assess the privacy risk in the dataset

In [None]:
import os

import privacy_fingerprint.extract.aws_comprehend as aws
import privacy_fingerprint.generate.language_model as llm
import privacy_fingerprint.generate.synthea as synthea
from privacy_fingerprint.common.config import (
    load_experiment_config,
    load_experiment_config_from_file,
    load_global_config_from_file,
)

Options within the process are controlled using two config files, one for global settings such as API keys, and another for a particular experiment. Example config files are available in the 'configs' directory. These example configs contain default settings but require modification to reflect your set-up of Julia, Synthea, and AWS.

In [None]:
# Move and modify the config files

load_global_config_from_file("../configs/global_config.yaml")
load_experiment_config_from_file("../configs/experiment_config.yaml")

# Config options can be modified in-line also. To keep this notebook/experiment small the number
# of patient records generated will be changed to 10.
expt_config = load_experiment_config()
expt_config.synthea.encounter_type = "Encounter for symptom"
expt_config.synthea.num_records = 10
# Note that to apply in-line changes you must reload the settings.
load_experiment_config(expt_config.dict())

In [None]:
from privacy_fingerprint.score import PrivacyRiskScorer, preprocess

Synthea generates a directory full of events according to the output_dir argument.

In [None]:
output_dir = "../experiments/01_simple_experiment/"
# os.mkdir(output_dir)
export_directory = os.path.join(output_dir, "synthea")

With the directory setup we can generate records. This may take some time, especially when generating a large number of records.

In [None]:
synthea_records = synthea.generate_records(export_directory)
print(f"Generated {len(synthea_records)} records.")

Despite requesting 10 records, we might not get 10 records. There are two reasons:

* Synthea will generate individuals and track them through time. If an individual dies they will not be counted towards `num_records` and Synthea will continue. The result is `num_records` living individual plus any that have died.
* It is also possible for less than `num_records` to be returned if some generated individuals did not have the medical encounter type specified in the config file.

We can now use the structured Synthea records to generate general medical notes for some encounters. This step calls the LLM API and returns unstructured text.

In [None]:
clinical_note_generator = llm.LMGenerator()
llm_results = clinical_note_generator.generate_text(synthea_records)

We can print a sample of the generated notes to read the model outputs. Note the text generator returns a generator not a list.

In [None]:
llm_results = list(llm_results)
print(*llm_results[:5], sep="\n\n------------------")

We then perform the "reverse" step by using an NER service (AWS ComprehendMedical) to extract the information we injected into the unstructured records again. This is the most expensive step of the process, so a helper formula is provided based on the costs as of March 10th 2023. Updated costs can be found on the AWS documentation.

In [None]:
print("Estimated cost is $", aws.calculate_ner_cost(llm_results))

We commit to this cost! and begin the extraction process.

In [None]:
aws_extract = aws.ComprehendExtractor()
ner_records = [
    aws_extract.extract_record(medical_note) for medical_note in llm_results
]

The result is a list of dictionaries of extracted entities. Individual entities, their text spans, and the NER's confidence in the output can be viewed in the output.

In [None]:
ner_records[0]["Entities"][0]

The raw output of the NER is converted to a standardised format used within this package prior to scoring. The DEFAULT set of identifiers is used, but your own identifiers can be included if necessary.

In [None]:
common_results = aws.prepare_common_records(
    aws.DEFAULT_IDENTIFIERS, ner_records
)
print(common_results[0])

We then need to convert these results into a table. Given some identifiers may have multiple values (such as disease, or prescription if people are receiving multiple medications) we must encode these in a particular manner. By default, we are using rarest encoding as this does not generate as many columns and runs more quickly. One-hot encoding is also available but the maximum number of columns should be tightly constrained to avoid excessive run times.

In [None]:
df_records = preprocess(common_results).fillna({"nhs_number": ""})
df_records.head()

Finally build a risk scorer. This class is a wrapper around the pycorrectmatch project allowing us to have meaningful column names. Since this demo set is only 10 records, which are likely all unique we limit ourselves to looking at only two columns, gender and nhs_number.

The scorer is fit to the provided dataset of genders and nhs numbers, and the individual uniqueness calculated for each row. This represents the likelihood of re-identifying them correctly, for example a score of .5 means there are 2 people sharing these features, so you have a 50% chance of identifying them.

In [None]:
risk_scorer = PrivacyRiskScorer()
print("Fitting")
risk_scorer.fit(df_records[["gender", "nhs_number"]])
print("Prediciting")
risk_scorer.predict(df_records[["gender", "nhs_number"]])

At this stage we know the records that most compromise privacy and might require actions to de-identify.

The analysis can be extended one step further to inform the best de-identification steps. The relative contribution of the different identifiers can be calculated using the explain module.

In [None]:
from privacy_fingerprint.explain import PrivacyRiskExplainer

In [None]:
transformed_dataset = risk_scorer.map_records_to_copula(
    df_records[["gender", "nhs_number"]]
)
N_FEATURES = df_records[["gender", "nhs_number"]].shape[1]
print(N_FEATURES)

In [None]:
# SHAP takes a while to run - a progress bar appears when running SHAP
explainer = PrivacyRiskExplainer(risk_scorer.predict_transformed, N_FEATURES)
# Calculating shapley values using the transformed_dataset
local_shapley_df, global_shap, exp_obj = explainer.explain(transformed_dataset)

The results can then to visualised. First, we can look at the contribution of each identifier across the entire dataset.

In [None]:
explainer.plot_global_explanation(exp_obj)

The contribution of each identifier for each record can also be visualised.

In [None]:
explainer.plot_local_explanation(exp_obj, 5)