# Example PrivacyFingerprint Experiment Pipeline

This notebook walks you through a potential experimental workflow, to introduce a user to how they can use the ExperimentalConfigHandler to test out various experimental workflows.

This pipeline will look at the difference in extraction outputs when using:
- GliNER
- UniversalNER hosted locally
- UniversalNER hosted via Ollama.

In [1]:
import os
import sys

path_root = os.path.dirname(os.getcwd())

if path_root not in sys.path:
    sys.path.append(path_root)

### Importing necessary functions from ./src folder

In [2]:
# Functions needed for the experimental config handler.
from src.config.experimental_config_handler import ExperimentalConfigHandler
from src.config.global_config import load_global_config

# Functions needed to create prompt templates and save them for the experiments.
from src.config.prompt_template_handler import (
    save_extraction_template_to_json,
    save_generate_template_to_json,
    load_and_validate_extraction_prompt_template,
    load_and_validate_generate_prompt_template,
)

### Importing and Loading Global and Experimental Config

**global_config_path**  this is the location of the global config path and then the output folder name is redefined to ensure the example experiments are out in the open. (Normally the default output folder should be used for your own experiments.)

**default_config_path** is given so the user can point to the default experimental config values. Currently the pipeline copies the original experimental config down into the folder, and if this exists, only uses the experimental config defined in that folder.

In [8]:
# Defines the location of the experimental config file you want to copy.
global_config_path = "../config/global_config.yaml"
global_config = load_global_config(global_config_path)
global_config.output_paths.output_folder = "../example_output"

# Defines the location of the default experimental config file you want the experimental config to have defaults using.
default_config_path = "../config/experimental_config.yaml"

Then you need to define you overrides: These are both optional parameters if you do not define any, the pipeline will run with the defaults provided in the experimental pipeline.


* **iter_overrides**: These are parameters where you want all unique parameters in each list defined to be combined together. 

    * For example: if "var1" = ["a", "b"] and var2 = ["c", "d"]. It would produces combinations = [["a","c"], ["a", "d"], ["b", "c"], ["b","d"]]



* **combine_overrides**: These are parameters where you want to combine two various parameters and only affect extraction.gliner_features, extraction.ollama_features, and extraction.local_features. 

    * For example: if "model_name" = ["model1", "model2"] and "prompt_template" = ["promp_template1","prompt_template2"]. It would produce combinations [["model1", "prompt_template1"]["model2", "prompt_template2"]]
    * The values defined in each list need to be the same length otherwise you will get an error. 



You will **NEED** to define your experiment name as this defines where your experiment folder should sit.



In [4]:
# Define your iter overrides
iter_overrides = {
    "outputs.experiment_name": "example_experiment_14_05_24",
    "extraction.server_model_type": ["gliner", "ollama", "local"],
}

# Define your combine overrides
combine_overrides = {}

# This initialises the experimental config handler.
config_handler = ExperimentalConfigHandler(
    default_config_path=default_config_path,
    iter_overrides=iter_overrides,
    combine_overrides=combine_overrides,
    global_config=global_config,
)

# This prints the config structures to the Users
print("---- SyntheaConfig ----")
for config in config_handler.load_component_experimental_config("synthea"):
    print(config)

print("\n---- GenerateConfig ----")
for config in config_handler.load_component_experimental_config("generate"):
    print(config)

print("\n---- ExtractionConfig ----")
for config in config_handler.load_component_experimental_config("extraction"):
    print(config)

Configuration file already exists at ../example_output/example_experiment_14_05_24/experimental_config.yaml
---- SyntheaConfig ----
population_num='50' county='West Yorkshire' path_output='../example_output/example_experiment_14_05_24/synthea/synthea_0.json'

---- GenerateConfig ----
llm_model_features=GenerateModelFeaturesConfig(llm_model_name='llama2', prompt_template_path='llama2_template.json') synthea_path='../example_output/example_experiment_14_05_24/synthea/synthea_0.json' path_output='../example_output/example_experiment_14_05_24/generate/generate_0.json'

---- ExtractionConfig ----
server_model_type='gliner' gliner_features=GlinerFeaturesConfig(gliner_model='urchade/gliner_medium-v2.1') local_features=LocalFeaturesConfig(hf_repo_id=None, hf_filename=None, prompt_template_path=None) ollama_features=OllamaFeaturesConfig(ollama_ner_model=None, prompt_template_path=None) entity_list=['person', 'date of birth', 'nhs number', 'diagnosis'] llm_path='../example_output/example_experim

## 1. Create Prompt Templates used in the Experimental Pipeline.

This pipeline uses two prompt templates that are required.
* Uses a prompt template for LLama2 which is used to generate the medical notes.
* Uses a prompt template for UniversalNER which is used to extract the entities via both local and ollama.

The below two cells defines these two templates and then saves them to a templates folder.

In [5]:
# Defines the path of where llama2 template lives in the generate folder.
generate_template_path = (
    f"{global_config.output_paths.generate_template}llama2_template.json"
)

# This defines a template used by LLama2
generate_template = """[INST]
<<SYS>>
You are a medical student answering an exam question about writing clinical notes for patients.
<</SYS>>

Keep in mind that your answer will be accessed based on incorporating all the provided information and the quality of prose.

1. Use prose to write an example clinical note for this patient's doctor.
2. Use less than three sentences.
3. Do not provide recommendations.
4. Use the following information:

{data}
[/INST]
"""

# Saves the template to the path defined.
save_generate_template_to_json(
    template_str=generate_template, file_path=generate_template_path
)

# Loads the template so the user can inspect the template saved.
loaded_generate_template = load_and_validate_generate_prompt_template(
    filename=generate_template_path
)
print(loaded_generate_template)

Template saved to '../config/templates/generate/llama2_template.json'
[INST]
<<SYS>>
You are a medical student answering an exam question about writing clinical notes for patients.
<</SYS>>

Keep in mind that your answer will be accessed based on incorporating all the provided information and the quality of prose.

1. Use prose to write an example clinical note for this patient's doctor.
2. Use less than three sentences.
3. Do not provide recommendations.
4. Use the following information:

{data}
[/INST]



In [6]:
extraction_template_path = f"{global_config.output_paths.extraction_template}universal_ner_template.json"

universalner_prompt_template = """
    USER: Text: {input_text}
    ASSISTANT: I’ve read this text.
    USER: What describes {entity_name} in the text?
    ASSISTANT: (model's predictions in JSON format)
    """

save_extraction_template_to_json(
    template_str=universalner_prompt_template,
    file_path=extraction_template_path,
)

loaded_extraction_template = load_and_validate_extraction_prompt_template(
    filename=extraction_template_path
)
print(loaded_extraction_template)

Template saved to '../config/templates/extraction/universal_ner_template.json'

    USER: Text: {input_text}
    ASSISTANT: I’ve read this text.
    USER: What describes {entity_name} in the text?
    ASSISTANT: (model's predictions in JSON format)
    


## 2. GenerateSynthea: Generating Synthetic Patient Data using Synthea 

This extracts out all of the synthea defined configuration and then runs the configuration through the pipeline and saves the data to an ./example_output/example_experiment_name folder.

In [5]:
config_handler.run_component_experiment_config(component_type="synthea")

synthea run 0 with config population_num='50' county='West Yorkshire' path_output='../example_output/example_experiment_14_05_24/synthea/synthea_0.json'
[{'NHS_NUMBER': '8295299905', 'DATE_OF_BIRTH': '2022-08-16', 'GIVEN_NAME': 'Gerda', 'FAMILY_NAME': 'Jacobson', 'DIAGNOSIS': 'Fracture of rib'}, {'NHS_NUMBER': '5714931319', 'DATE_OF_BIRTH': '2020-10-09', 'GIVEN_NAME': 'Grisel', 'FAMILY_NAME': 'Turner', 'DIAGNOSIS': 'Otitis media'}, {'NHS_NUMBER': '4231818843', 'DATE_OF_BIRTH': '2012-05-15', 'GIVEN_NAME': 'Cruz', 'FAMILY_NAME': 'Predovic', 'DIAGNOSIS': 'Streptococcal sore throat (disorder)'}, {'NHS_NUMBER': '6337722493', 'DATE_OF_BIRTH': '2022-12-11', 'GIVEN_NAME': 'Pierre', 'FAMILY_NAME': 'Turcotte', 'DIAGNOSIS': 'Otitis media'}, {'NHS_NUMBER': '2696311943', 'DATE_OF_BIRTH': '2002-05-28', 'GIVEN_NAME': 'Benjamin', 'FAMILY_NAME': 'Volkman', 'DIAGNOSIS': 'Viral sinusitis (disorder)'}, {'NHS_NUMBER': '8437514932', 'DATE_OF_BIRTH': '1994-09-21', 'GIVEN_NAME': 'Myra', 'FAMILY_NAME': 'Smitha

## 3. GenerateLLM: Generating Synthetic Patient Medical Notes 

This extracts out all of the generate defined configuration and then runs the configuration through the pipeline and saves the data to an ./example_output/experiment_name folder.

In [6]:
config_handler.run_component_experiment_config(component_type="generate")

generate run 0 with config llm_model_features=GenerateModelFeaturesConfig(llm_model_name='llama2', prompt_template_path='llama2_template.json') synthea_path='../example_output/example_experiment_14_05_24/synthea/synthea_0.json' path_output='../example_output/example_experiment_14_05_24/generate/generate_0.json'
["Clinical Note:\nPatient Gerda Jacobson, NHS number 8295299905, presents with a fracture of the right rib. The patient was involved in a motor vehicle accident on August 16th, 2022. Patient's date of birth is August 16th, 1995.", "Clinical Note:\nPatient Grisel Turner, NHS number 5714931319, presents today with a 3-day history of right-sided otitis media. The patient is a 20-year-old female and was born on October 9, 2020. The diagnosis of otitis media is consistent with the patient's symptoms of ear pain, fever, and difficulty hearing. Further assessment and treatment are necessary to manage this condition effectively.", " Clinical Note:\n\nPatient: Cruz Predovic\nNHS Number: 

## 4. Extraction: Re-extracting Entities from the Patient Medical Notes

This extracts out all of the extraction defined configuration and then runs the configuration through the pipeline and saves the data to an ./example_output/experiment_name folder.

In [7]:
config_handler.run_component_experiment_config(component_type="extraction")

extraction run 0 with config server_model_type='gliner' gliner_features=GlinerFeaturesConfig(gliner_model='urchade/gliner_medium-v2.1') local_features=LocalFeaturesConfig(hf_repo_id=None, hf_filename=None, prompt_template_path=None) ollama_features=OllamaFeaturesConfig(ollama_ner_model=None, prompt_template_path=None) entity_list=['person', 'date of birth', 'nhs number', 'diagnosis'] llm_path='../example_output/example_experiment_14_05_24/generate/generate_0.json' path_output='../example_output/example_experiment_14_05_24/extraction/extraction_0.json'




[{'Entities': [{'start': 15, 'end': 37, 'text': 'Patient Gerda Jacobson', 'label': 'person', 'score': 0.9321390986442566}, {'start': 39, 'end': 60, 'text': 'NHS number 8295299905', 'label': 'nhs number', 'score': 0.9702428579330444}, {'start': 78, 'end': 103, 'text': 'fracture of the right rib', 'label': 'diagnosis', 'score': 0.6524916887283325}, {'start': 161, 'end': 178, 'text': 'August 16th, 2022', 'label': 'date of birth', 'score': 0.7675861120223999}, {'start': 207, 'end': 224, 'text': 'August 16th, 1995', 'label': 'date of birth', 'score': 0.9273738265037537}]}, {'Entities': [{'start': 15, 'end': 36, 'text': 'Patient Grisel Turner', 'label': 'person', 'score': 0.9105566740036011}, {'start': 38, 'end': 59, 'text': 'NHS number 5714931319', 'label': 'nhs number', 'score': 0.9622470736503601}, {'start': 100, 'end': 124, 'text': 'right-sided otitis media', 'label': 'diagnosis', 'score': 0.6138395667076111}, {'start': 178, 'end': 193, 'text': 'October 9, 2020', 'label': 'date of birth'

  warn_deprecated(


[{'Entities': [{'start': 23, 'end': 37, 'text': 'Gerda Jacobson', 'label': 'person', 'score': 1}, {'start': 207, 'end': 224, 'text': 'August 16th, 1995', 'label': 'date of birth', 'score': 1}, {'start': 39, 'end': 60, 'text': 'NHS number 8295299905', 'label': 'nhs number', 'score': 1}]}, {'Entities': [{'start': 23, 'end': 36, 'text': 'Grisel Turner', 'label': 'person', 'score': 1}, {'start': 155, 'end': 161, 'text': 'female', 'label': 'person', 'score': 1}, {'start': 178, 'end': 193, 'text': 'October 9, 2020', 'label': 'date of birth', 'score': 1}, {'start': 38, 'end': 59, 'text': 'NHS number 5714931319', 'label': 'nhs number', 'score': 1}, {'start': 112, 'end': 124, 'text': 'otitis media', 'label': 'diagnosis', 'score': 1}, {'start': 212, 'end': 224, 'text': 'otitis media', 'label': 'diagnosis', 'score': 1}]}, {'Entities': [{'start': 26, 'end': 39, 'text': 'Cruz Predovic', 'label': 'person', 'score': 1}, {'start': 78, 'end': 90, 'text': 'May 15, 2012', 'label': 'date of birth', 'score

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from ../models/UniversalNER-7B-all-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    2:              blk.0.attn_k.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    3:              blk.0.attn_v.weight q6_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    4:         blk.0.attn_output.weight q4_K     [  4096,  4096,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    6:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    7:            blk.0.ffn_down.weight q6_K     [ 11008,  4096, 

[{'Entities': [{'start': 23, 'end': 37, 'text': 'Gerda Jacobson', 'label': 'person', 'score': 1}, {'start': 207, 'end': 224, 'text': 'August 16th, 1995', 'label': 'date of birth', 'score': 1}, {'start': 50, 'end': 60, 'text': '8295299905', 'label': 'nhs number', 'score': 1}]}, {'Entities': [{'start': 23, 'end': 36, 'text': 'Grisel Turner', 'label': 'person', 'score': 1}, {'start': 178, 'end': 193, 'text': 'October 9, 2020', 'label': 'date of birth', 'score': 1}, {'start': 49, 'end': 59, 'text': '5714931319', 'label': 'nhs number', 'score': 1}]}, {'Entities': [{'start': 26, 'end': 39, 'text': 'Cruz Predovic', 'label': 'person', 'score': 1}, {'start': 78, 'end': 90, 'text': 'May 15, 2012', 'label': 'date of birth', 'score': 1}, {'start': 195, 'end': 220, 'text': 'Streptococcal sore throat', 'label': 'diagnosis', 'score': 1}]}, {'Entities': [{'start': 72, 'end': 84, 'text': 'otitis media', 'label': 'diagnosis', 'score': 1}, {'start': 96, 'end': 104, 'text': 'ear pain', 'label': 'diagnosis


llama_print_timings:        load time =    7455.61 ms
llama_print_timings:      sample time =       4.55 ms /    20 runs   (    0.23 ms per token,  4399.47 tokens per second)
llama_print_timings: prompt eval time =     294.50 ms /    24 tokens (   12.27 ms per token,    81.49 tokens per second)
llama_print_timings:        eval time =    1454.68 ms /    19 runs   (   76.56 ms per token,    13.06 tokens per second)
llama_print_timings:       total time =    1827.53 ms
ggml_metal_free: deallocating


## 5. Visualising the Experiment Workflow

This method on the config handler allows a user to inspect their workflows data. This allows the user to get an idea of which configuration type runs into which output type.

In [9]:
config_handler.load_pipeline_visualisation()

## Reload the Data

By using the above workflow, you can then specify which data you would like to reload back into the notebook.

In [13]:
config_handler.load_specified_data_file(filename="extraction_0")

[{'Entities': [{'start': 15,
    'end': 37,
    'text': 'Patient Gerda Jacobson',
    'label': 'person',
    'score': 0.9321390986442566},
   {'start': 39,
    'end': 60,
    'text': 'NHS number 8295299905',
    'label': 'nhs number',
    'score': 0.9702428579330444},
   {'start': 78,
    'end': 103,
    'text': 'fracture of the right rib',
    'label': 'diagnosis',
    'score': 0.6524916887283325},
   {'start': 161,
    'end': 178,
    'text': 'August 16th, 2022',
    'label': 'date of birth',
    'score': 0.7675861120223999},
   {'start': 207,
    'end': 224,
    'text': 'August 16th, 1995',
    'label': 'date of birth',
    'score': 0.9273738265037537}]},
 {'Entities': [{'start': 15,
    'end': 36,
    'text': 'Patient Grisel Turner',
    'label': 'person',
    'score': 0.9105566740036011},
   {'start': 38,
    'end': 59,
    'text': 'NHS number 5714931319',
    'label': 'nhs number',
    'score': 0.9622470736503601},
   {'start': 100,
    'end': 124,
    'text': 'right-sided otitis