In [None]:
import os
import sys
from spacy import displacy

path_root = os.path.dirname(os.getcwd())

if path_root not in sys.path:
    sys.path.append(path_root)

In [None]:
from src.config import (
    path_output_synthea,
    path_output_llm,
    path_output_extraction,
    path_output_standardisation,
)
from src.generate.synthea import GenerateSynthea
from src.generate.llm import GenerateLLM
from src.extraction.extraction import Extraction
from src.standardise_extraction.standardise_extraction import (
    StandardiseExtraction,
)

# Privacy Fingerprint End-to-End Overview

The Pipeline has been broken down into four components:
1. **GenerateSynthea**: This generates a list of dictionary of synthetic patient records.
2. **GenerateLLM**: This generates medical notes using the outputs created from **GenerateSynthea**.
3. **Extraction**: This currently uses an LLM that is specialised to extract given entities from the synthetic medical notes produced by **GenerativeLLM**
4. **StandardiseExtraction**: This standardises the results extracted from the medical text.

Each of these classes takes in a path_output, when save_output is set to True, it will save the output to this path_output defined.
These paths have been defined in the src/config.py file:
- path_output_synthea = data_folder + "/synthea.json"
- path_output_llm = data_folder + "/llm.json"
- path_output_extraction = data_folder + "/generative.json"
- path_output_standardisation = data_folder + "/standardisation.json"

Additionally each class will also take a path for the input required to create their output. This allows the user to break-up the pipeline and run from specific points in the pipeline.

## 1. GenerateSynthea: Generating Synthetic Patient Data using Synthea 

Synthea-international is an expansion of Synthea, which is an open-source synthetic patient generator that produces de-identified health records for synthetic patients.

GenerateSynthea is a class used to run Synthea. You will need to follow the instructions on the README to ensure Synthea is installed.
- "./run_synthea" is a command line input that calls to run synthea.
- "-p" is a person flag
- "5" Where 5 determines the number of patients you want to generate. (Alter this to generate more records.)
- "West Yorkshire" Synthea only works on a regional basis, therefore you have to give county information so it can generate address type data.

In the src/config.py there is some given config values:
- path_synthea = "../../synthea" - This defines the location of where synthea is from the src folder.
- path_csv = path_synthea + "/output/csv" - This defines the location where outputs are saved to when synthea is ran.
- path_patients = path_csv + "/patients.csv" - This is a .csv that holds patients synthetic demographic information etc. 
- path_encounters = path_csv + "/encounters.csv" - This is a .csv that holds encounters, i.e., this holds multiple times a patient has gone for medical assessment/treatment.
- cols_patients = ["Id", "BIRTHDATE", "FIRST", "LAST"] - This determines the columns we extract from path_patients.
- cols_encounters = ["PATIENT", "ENCOUNTERCLASS", "REASONDESCRIPTION"] - This defines the columns we extract from path_encounters.

In [None]:
output_synthea = GenerateSynthea(
    path_output=path_output_synthea, save_output=True
).run("./run_synthea", "-p", "5", "West Yorkshire")
output_synthea

This loads the model from path

In [None]:
output_synthea = GenerateSynthea(path_output=path_output_synthea).load()
output_synthea

## 2. GenerateLLM: Generating Synthetic Patient Medical Notes 

Currenty GenerateLLM uses Ollama to run a range of pre-trained models you can use.
- model - This determines the model you want to use.
- template - This defines the prompt-template you want to give to the LLM model to generate each patients medical record.

In the src/config.py file, there is a *cols* parameter. This parameter currently maps Synthea column names to names used in the template to generate these medical notes.

```
cols = {
    "NHS_NUMBER": "NHS_NUMBER",
    "BIRTHDATE": "DATE_OF_BIRTH",
    "FIRST": "GIVEN_NAME",
    "LAST": "FAMILY_NAME",
    "REASONDESCRIPTION": "DIAGNOSIS",
}
```

In [None]:
model = "llama2"
template = """[INST]
<<SYS>>
You are a medical student answering an exam question about writing clinical notes for patients.
<</SYS>>

Keep in mind that your answer will be asssessed based on incorporating all the provided information and the quality of prose.

1. Use prose to write an example clinical note for this patient's doctor.
2. Use less than three sentences.
3. Do not provide a recommendations.
4. Use the following information:

{data}
[/INST]
"""

This runs GenerateLLM using the synthea output from the previous run, and saves the LLM output to the given path_output_llm.

In [None]:
output_llm = GenerateLLM(
    synthea_input=output_synthea, path_output=path_output_llm, save_output=True
).run(model, template)
output_llm

This runs GenerateLLM using a pre-saved synthea output saved at path_output_synthea, and generates a local output_llm. In comparison to the run above this will produce slightly different results.

In [None]:
output_llm = GenerateLLM(
    synthea_path=path_output_synthea,
    path_output=path_output_llm,
    save_output=False,
).run(model, template)
output_llm

This loads the current saved output at path_output_llm.

In [None]:
output_llm = GenerateLLM(path_output=path_output_llm).load()
output_llm

## 3. Extraction: Re-extracting Entities from the Patient Medical Notes

This uses a local quanitised UniversalNER model to extract entities from the synthetic medical notes. You will need to follow the README to host the UniversalNER model locally.

In the src/config file:
- entity_list = ["person", "nhs number", "date of birth", "diagnosis"] - This is the list of entities you want to extract from the synthetic medical notes.
- universal_ner_path = "../models/quantized_q4_1.gguf" - This is the path to the quantized universal model located in a models folder on the top level of this repo.

This runs the extraction class from an output generated in this notebook, and is save the extraction output to the path given.

In [None]:
output_extraction = Extraction(
    llm_input=output_llm, path_output=path_output_extraction, save_output=True
).run()
output_extraction

This runs the extraction class from a pre-saved llm output, and creates an output_extraction locally. In comparison to the run above this will produce slightly different results.

In [None]:
output_extraction = Extraction(
    llm_path=path_output_llm,
    path_output=path_output_extraction,
    save_output=False,
).run()
output_extraction

This loads the extraction output at the given path.

In [None]:
output_extraction = Extraction(path_output=path_output_extraction).load()
output_extraction

This visualises an the entities in an example clinical note using DisplaCy.

We format the extracted entities into a dictionary compatable with DisplaCy, and display the string.


In [None]:
string_id = 1

ents_dict = {
    "text": output_llm[string_id],
    "ents": output_extraction[string_id]["Entities"],
}

In [None]:
displacy.render(ents_dict, manual=True, style="ent")

## 4. StandardiseExtraction: Normalising Entities Extracted for Scoring

This takes in the above List of Dictionary entities and begins to normalise the responses into a dataframe format.

The standardisation process is broken down into many parts:
1. Entities are extracted from the object created from **Extraction**, and a set of functions can be applied to clean them during this process.
2. This creates a list of cleaned entities. Multiple entities can be extracted from the same person for a given entity type, for example diagnosis. Currently the codebase only takes the first entity given.
3. Next the outputs are normalised i.e. Dates can be written in multiple formats but have the same meaning.
4. Lastly the data is encoded and formatted as a numpy array ready for PyCorrectMatch

In the src/config.py file:

extra_preprocess_functions_per_entity defines how entities are cleaned while extracted from the extraction_output.

```
extra_preprocess_functions_per_entity = {"person": [clean_name.remove_titles]}
```

standardise_functions_per_entity defines how entities are extracted, and defines any normalisation process you may want on a column of entities.
```
standardise_functions_per_entity = {
    "person": [extract_first_entity_from_list],
    "nhs number": [extract_first_entity_from_list],
    "date of birth": [
        extract_first_entity_from_list,
        normalise_columns.normalise_date_column,
    ],
    "diagnosis": [extract_first_entity_from_list],
}
```

This uses the output_extraction value created by the **Extraction** class and saves the outputs of the normalisation process as a .csv to the given path.

In [None]:
output_standards = StandardiseExtraction(
    extraction_input=output_extraction,
    path_output=path_output_standardisation,
    save_output=True,
).run()
output_standards

This loads an extraction input from the extraction_path provided, and creates the output_standards.

In [None]:
output_standards = StandardiseExtraction(
    extraction_path=path_output_extraction,
    path_output=path_output_standardisation,
    save_output=False,
).run()
output_standards

This loads a pre-saved output_standards from the given path provided.

In [None]:
output_standards = StandardiseExtraction(
    path_output=path_output_standardisation
).load()
output_standards