## Data Generation 

In this notebook, we generate our dataset from scratch. 

We follow the steps in the figure. 

![title](figures/data_generation.png)

## Step 1: Expert-defined ground-truth Bayesian network 

In [1]:
from utils.data import DiseaseModelGT

data_model = DiseaseModelGT()

In [2]:
print(data_model.model.get_cpds("season"))
print(data_model.model.get_cpds("pneu"))
print(data_model.model.get_cpds("inf"))
print(data_model.model.get_cpds("dysp"))
print(data_model.model.get_cpds("nasal"))
print(data_model.model.get_cpds("cough"))
print(data_model.model.get_cpds("fever"))
print(data_model.model.get_cpds("pain"))

+--------------+-----+
| season(cold) | 0.4 |
+--------------+-----+
| season(warm) | 0.6 |
+--------------+-----+
+-----------+--------------+--------------+
| season    | season(cold) | season(warm) |
+-----------+--------------+--------------+
| pneu(yes) | 0.015        | 0.005        |
+-----------+--------------+--------------+
| pneu(no)  | 0.985        | 0.995        |
+-----------+--------------+--------------+
+----------+--------------+--------------+
| season   | season(cold) | season(warm) |
+----------+--------------+--------------+
| inf(yes) | 0.5          | 0.05         |
+----------+--------------+--------------+
| inf(no)  | 0.5          | 0.95         |
+----------+--------------+--------------+
+-----------+-----------+----------+
| pneu      | pneu(yes) | pneu(no) |
+-----------+-----------+----------+
| dysp(yes) | 0.3       | 0.15     |
+-----------+-----------+----------+
| dysp(no)  | 0.7       | 0.85     |
+-----------+-----------+----------+
+------------+---

## Step 2: Sample tabular variables

In [3]:
from sklearn.model_selection import train_test_split

N = 5000
seed = 547

samples = data_model.sample(N, seed=seed)
train, test = train_test_split(samples, train_size=0.8, random_state=seed)

Generating for node: pain: 100%|██████████| 8/8 [00:00<00:00, 129.99it/s]


## Step 3: Prompt large language model to create clinical notes

This part is not fully reproducible, since ChatGPT will provide different responses every time it's prompted. However, we still include a demonstration of how we obtained the text descriptions for clarity. 

Only 2/3 of the training set needs an associated text, so first we randomly split off this portion of the training samples. 

In [4]:
from sklearn.model_selection import train_test_split

seed = 547
train_text, _ = train_test_split(train, train_size=2667, random_state=seed)

We generate all the symptom combinations and the number of texts we need to generate for each. 

In [5]:
def create_count_df(df):
    prompt_data = df[["dysp", "cough", "fever", "pain", "nasal"]]
    prompt_data = prompt_data.rename(columns={"dysp": "dyspnea", "pain": "chest pain / pain attributed to airways", "nasal": "sneezing / blocked nose"})
    column_names = prompt_data.columns.tolist()
    prompt_data["combo"] = prompt_data.agg(",".join, axis=1)
    combo_counts = prompt_data["combo"].value_counts().reset_index()
    combo_counts[column_names] = combo_counts["index"].str.split(",", expand=True)
    count_df = combo_counts.drop(columns=["index"])
    count_df = count_df.rename(columns={"combo": "count"})
    return count_df

count_combinations = create_count_df(train_text)

In [6]:
count_combinations.head()

Unnamed: 0,count,dyspnea,cough,fever,chest pain / pain attributed to airways,sneezing / blocked nose
0,1032,no,no,none,no,no
1,317,no,no,none,no,yes
2,229,no,yes,none,no,yes
3,186,no,no,low,no,no
4,150,yes,no,none,no,no


To generate our data, we used 20 example notes on real patient encounters from a GP's clinical practice. For privacy reasons, we cannot publish this data. To illustrate our prompting method, we create some fake examples that stand in for the real patient encounter notes. Every note is annotated with the symptoms experienced by this patient. Not all symptoms are necessary described in the text. 

In [11]:
import pandas as pd
dummy_example_notes = [
                       {"dyspnea": "no", "cough": "no", "fever": "none", "chest pain / pain attributed to airways": "no", "sneezing / blocked nose": "no",
                       "text": "The patient presents with shoulder pain. They have trouble raising their right arm all the way. This has been going on for a week."},
                       {"dyspnea": "no", "cough": "no", "fever": "none", "chest pain / pain attributed to airways": "no", "sneezing / blocked nose": "no",
                        "text": "Patient presents with long-term fatigue that is not solved by a good night's sleep. They also show a loss of apetite"},
                       {"dyspnea": "no", "cough": "yes", "fever": "none", "chest pain / pain attributed to airways": "no", "sneezing / blocked nose": "yes",
                        "text": "Young child presents with a heavy cough with phlegm. He shows no signs of fever, though has been sneezing for the past couple of days."},
                       {"dyspnea": "yes", "cough": "yes", "fever": "none", "chest pain / pain attributed to airways": "yes", "sneezing / blocked nose": "no",
                        "text": "Patient experiences trouble breathing which intensifies during exercise. Sometimes they have to pause when walking up the stairs. \
                        They also worry about their painful scraping cough."},
                       {"dyspnea": "no", "cough": "no", "fever": "none", "chest pain / pain attributed to airways": "yes", "sneezing / blocked nose": "yes",
                        "text": "Young man feels a stabbing pain in the chest when taking deep breaths. He has been experiencing cold symptoms, such as a stuffy nose, for 2 days now."},
                       {"dyspnea": "no", "cough": "no", "fever": "high", "chest pain / pain attributed to airways": "no", "sneezing / blocked nose": "no",
                        "text": "Newborn is sufferring from reduced appetite and high fever of up to 40°C. Mother is growing increasingly worried."}
                       ]
df_examples = pd.DataFrame(dummy_example_notes)

In [12]:
df_examples.head()

Unnamed: 0,dyspnea,cough,fever,chest pain / pain attributed to airways,sneezing / blocked nose,text
0,no,no,none,no,no,The patient presents with shoulder pain. They ...
1,no,no,none,no,no,Patient presents with long-term fatigue that i...
2,no,yes,none,no,yes,Young child presents with a heavy cough with p...
3,yes,yes,none,yes,no,Patient experiences trouble breathing which in...
4,no,no,none,yes,yes,Young man feels a stabbing pain in the chest w...


The most commonly occurring symptom combination is the one where all symptoms are absent from the patient (all features are "no"/"none"). We first generate clinical text notes for all *other* combinations, then return to the most occurring one. 

We use the following two templates, of which we provide an example in the OpenAI playground: 
- `prompt_known_combo`: The symptom combination is available in the list of examples: https://platform.openai.com/playground/p/poSdvoy9dipYIwVPXepRrUyL?model=gpt-3.5-turbo&mode=chat 
- `prompt_unknown_combo`: The symptom combination is not available in the list of examples: https://platform.openai.com/playground/p/6KOm6pP6DXmMwDxUJWdGGHP1?model=gpt-3.5-turbo&mode=chat 

The code for creating these templates for an arbitrary symptom combination is provided in utils/prompt.py. 

Below, we iterate over all symptom combinations, except the first one, and generate the necessary number of clinical texts. 

In [None]:
import openai
import random 

from utils.prompt import prompt_known_combo, prompt_unknown_combo, get_random_examples

openai.api_key = "fill-in-your-openai-key"
openai.organization = "fill-in-your-organization"

count_combinations["text"] = ""
max_req = 5
temp = 1
freq_pen = 0.5
n_rand_ex = 2
debug = False
random.seed(547)

# iterate over all possible symptom combinations
for i in range(1, len(count_combinations)): # skip first one (all symptoms not present, gets special prompt)

    # request set of symptoms
    req_combo = count_combinations.iloc[i]
    n = req_combo["count"]
    req_symptoms = req_combo.drop(labels=["count", "text"]).to_dict()

    # get symptom combination from examples 
    sel_examples = df_examples
    for symp, val in req_symptoms.items(): 
        sel_examples = sel_examples[sel_examples[symp] == val]
    examples = sel_examples["text"].to_list()
    
    # if example exists, prompt known combo 
    if len(examples) > 0:
        response = []
        if n > max_req:
            mask_neg = False # mask out the negative symptoms in the prompt half of the time
            for _ in range(n//max_req): # process request in groups of max_req
                messages, resp, full_resp = prompt_known_combo(req_symptoms, examples, max_req, mask_neg, temp, freq_pen, debug=debug) 
                response += resp
                mask_neg = True if mask_neg == False else False
            n_remaining = n%max_req
            if n_remaining > 0:
                messages, resp, full_resp = prompt_known_combo(req_symptoms, examples, n_remaining, mask_neg, temp, freq_pen, debug=debug) 
                response += resp
        else: 
            mask_neg = False # don't mask out negative symptoms when only using one prompt
            messages, resp, full_resp = prompt_known_combo(req_symptoms, examples, n, mask_neg, temp, freq_pen, debug=debug)
            response += resp

        count_combinations.at[i, "text"] = response
    
    # if no example exists, prompt unknown combo
    else: 

        response = []
        if n > max_req:
            
            mask_neg = False # mask out the negative symptoms in the prompt half of the time
            for _ in range(n//max_req): # process request in groups of max_req
                examples = get_random_examples(df_examples, n_rand_ex)
                messages, resp, full_resp = prompt_unknown_combo(req_symptoms, examples, max_req, mask_neg, temp, freq_pen, debug=debug) 
                response += resp
                mask_neg = True if mask_neg == False else False
            n_remaining = n%max_req
            if n_remaining > 0:
                examples = get_random_examples(df_examples, n_rand_ex)
                messages, resp, full_resp = prompt_unknown_combo(req_symptoms, examples, n_remaining, mask_neg, temp, freq_pen, debug=debug) 
                response += resp
        else: 
            mask_neg = False # don't mask out negative symptoms when only using one prompt
            examples = get_random_examples(df_examples, n_rand_ex)
            messages, resp, full_resp = prompt_unknown_combo(req_symptoms, examples, n, mask_neg, temp, freq_pen, debug=debug)
            response += resp

        count_combinations.at[i, "text"] = response

Afterwards, we checked whether the appropriate number of texts were generated for each symptom combination and checked for obvious mistakes. 

Since we need a large number of examples for the symptom combination where none of the symptoms are observed, we use a variety of prompting strategies to get sufficient variation in the generated clinical notes. The following templates are used as prompting strategies. We mention the amount of clinical notes we obtain this way for both the train and test set. 
1. `prompt_unrelated_strat13`: Prompt with 2 random in-context examples from the set of clinical note examples where all symptoms are "no": https://platform.openai.com/playground/p/Zw9y4EZ8RRfGaZTgBCPu5DPT?model=gpt-3.5-turbo&mode=chat (400 in train set, 151 in test set)
2. `prompt_unknown_combo` (from before): Prompt with 2 random out-of-context examples from the full set of clinical note examples, where symptom combination can be anything: https://platform.openai.com/playground/p/jYJeFfzXVZ5akndHGkB6cmPk?model=gpt-3.5-turbo&mode=chat (50 in train set, 19 in test set)
3. `prompt_unrelated_strat13`: Prompt with 2 random in-context examples from the set of clinical note examples where all symptoms are "no", but don't mention in the prompt that these symptoms are absent. Afterwards, we filter for mentions of our undesired symptoms manually: https://platform.openai.com/playground/p/Ikszv18Zbqr162kF0iSErbCT?model=gpt-3.5-turbo&mode=chat (107 in train set, 39 in test set)
4. `prompt_unrelated_strat45`: Prompt without any examples, without mentioning the symptoms that should not be present. Afterwards, we filter for mentions of our undesired symptoms manually: https://platform.openai.com/playground/p/8KaVpr7chxLMHyeKz6Y2mMVE?model=gpt-3.5-turbo&mode=chat (111 in train set, 43 in test set)
5. `prompt_unrelated_strat45`: Prompt without any examples, but mention the symptoms that should not be present: https://platform.openai.com/playground/p/eXDqpwMkqcvUA8wWiW1H06et?model=gpt-3.5-turbo&mode=chat (364 in train set, 136 in test set)

In [None]:
from utils.prompt import prompt_unrelated_strat13

# Strategy 1
temp = 1 
freq_pen = 0.5
mention_sympt = True
random.seed(547)
n_req = 10
n_ex = 2
df_examples_no = df_examples.iloc[0:4]
responses = []
full_resp = []

for i in range(40):
    messages, resp_split, response = prompt_unrelated_strat13(n_req, df_examples_no, n_ex, mention_sympt, temp, freq_pen, debug=False)
    full_resp.append(response)
    responses += resp_split
    print(i)

In [None]:
# Strategy 2

n_rand_ex = 2
n_req = 10
mask_neg = False
temp = 1
freq_pen = 0.5

req_combo = count_combinations.iloc[0]
req_symptoms = req_combo.drop(labels=["count", "text"]).to_dict()

responses = []
full_resp = []

for i in range(5):
    examples = get_random_examples(df_examples, n_rand_ex)
    messages, resp_split, response = prompt_unknown_combo(req_symptoms, examples, n_req, mask_neg, temp, freq_pen, debug=False)
    full_resp.append(response)
    responses += resp_split
    print(i)

In [None]:
# Strategy 3
temp = 1 
freq_pen = 0.5
mention_sympt = False
random.seed(547)
n_req = 10
n_ex = 2
df_examples_no = df_examples.iloc[0:4]
responses = []
full_resp = []

for i in range(11):
    messages, resp_split, response = prompt_unrelated_strat13(n_req, df_examples_no, n_ex, mention_sympt, temp, freq_pen, debug=False)
    full_resp.append(response)
    responses += resp_split
    print(i)

In [None]:
from utils.prompt import prompt_unrelated_strat45

# Strategy 4
temp = 1 
freq_pen = 0.5
mention_sympt = False
responses = []
full_resp = []
for i in range(12):
    messages, resp_split, response = prompt_unrelated_strat45(10, temp, freq_pen, mention_sympt, debug=False)
    full_resp.append(response)
    responses += resp_split
    print(i)

In [None]:
# Strategy 5
temp = 1 
freq_pen = 0.5
mention_sympt = True
responses = []
full_resp = []
for i in range(35):
    messages, resp_split, response = prompt_unrelated_strat45(10, temp, freq_pen, mention_sympt, debug=False)
    full_resp.append(response)
    responses += resp_split
    print(i)

After generating the text according to the strategies outlined above, we end up with a dataframe where every symptom combination is associated with a list of n possible text descriptions, with n being the number of times this combination occurs in the training set. 

We then added these clinical notes to the 2/3 of the training set in need of text. We repeat the same process for the test set. 

The resulting dataframes are loaded in below.

In [52]:
import pickle 
with open("data/suppl/train_2667_text.p", "rb") as file: 
    train_text = pickle.load(file)
with open("data/suppl/train_1333_notext.p", "rb") as file: 
    train_no_text = pickle.load(file)
with open("data/suppl/test_1000_text.p", "rb") as file: 
    test = pickle.load(file)

In [26]:
train_text.head()

Unnamed: 0,season,pneu,inf,dysp,cough,fever,pain,nasal,text
0,warm,no,no,no,no,high,no,no,Patient presenting with a high fever (39.4°C) ...
1,warm,no,no,no,no,high,no,no,High-grade fever (39.6°C) persisting for two d...
2,warm,no,no,no,no,high,no,no,Patient complaining of a high fever (38.7°C) t...
3,warm,no,no,no,no,high,no,no,Fever of 39°C reported since yesterday with as...
4,warm,no,no,no,no,high,no,no,Significant fever (40°C) for the past 24 hours...


In [27]:
train_no_text.head()

Unnamed: 0,season,pneu,inf,dysp,cough,fever,pain,nasal,text
4504,cold,no,yes,no,yes,none,no,yes,
2491,cold,no,yes,no,yes,none,no,yes,
825,warm,no,no,no,no,low,no,no,
2766,cold,no,no,no,no,low,no,no,
2681,cold,no,yes,yes,yes,none,no,yes,


In [28]:
len(train_text), len(train_no_text)

(2667, 1333)

In [29]:
test.head()

Unnamed: 0,season,pneu,inf,dysp,cough,fever,pain,nasal,text
0,cold,no,no,no,no,high,no,no,Patient presents with high-grade fever of 39.5...
1,warm,no,no,no,no,high,no,no,Chief complaint of persistent high fever (up t...
2,warm,no,no,no,no,high,no,no,Febrile patient with history of high-grade fev...
3,cold,no,no,no,no,high,no,no,Patient reports ongoing high fever up to 38.9°...
4,warm,no,no,no,no,high,no,no,Examined patient exhibiting high fever spikes ...


In [31]:
len(test)

1000

## Step 4: Make data partially observed

### Train set

For the training set, we have some additional steps to do. We already masked out the text in 1/3 of the training data (step 4c) by not generating a clinical note for those samples. 

For the remaining 2/3 of the data, the symptoms are masked in 1/3 of the samples (step 4b).

In [42]:
from sklearn.model_selection import train_test_split

seed = 547
train_obs, train_no_obs = train_test_split(train_text, train_size=1333, random_state=seed)

In [43]:
len(train_obs), len(train_no_obs)

(1333, 1334)

In [44]:
import numpy as np
train_no_obs.loc[:, ["dysp", "cough", "pain", "fever", "nasal"]] = np.nan
train_no_obs.head()

Unnamed: 0,season,pneu,inf,dysp,cough,fever,pain,nasal,text
281,warm,no,no,,,,,,Patient complains of low-grade fever for the p...
1466,warm,no,no,,,,,,Patient complaining of nasal congestion for th...
543,warm,no,no,,,,,,"Patient seeking advice on smoking cessation, r..."
2322,cold,no,no,,,,,,"Patient presents with mild dyspnea, mainly wit..."
2113,cold,no,yes,,,,,,"Complains of coughing fits, along with constan..."


We now combine the three parts, making one big training set of 4000 samples.

In [46]:
import pandas as pd
train_full = pd.concat([train_no_text, train_no_obs, train_obs])
train_full = train_full.reset_index(drop=True)

In [47]:
train_full.head()

Unnamed: 0,season,pneu,inf,dysp,cough,fever,pain,nasal,text
0,cold,no,yes,no,yes,none,no,yes,
1,cold,no,yes,no,yes,none,no,yes,
2,warm,no,no,no,no,low,no,no,
3,cold,no,no,no,no,low,no,no,
4,cold,no,yes,yes,yes,none,no,yes,


Finally, we remove the symptoms "fever" and "pain" from the training set (step 4a).

In [48]:
train_full_masked = train_full[["season", "pneu", "inf", "dysp", "cough", "nasal", "text"]]

This is the training set found in "data/train_4000_final.p". We also save the version where the symptoms "fever" and "pain" are still observed, in "data/train_4000_final_obs.p", so we can use it to train the BN++ baseline. 

In [49]:
train_full_masked.head()

Unnamed: 0,season,pneu,inf,dysp,cough,nasal,text
0,cold,no,yes,no,yes,yes,
1,cold,no,yes,no,yes,yes,
2,warm,no,no,no,no,no,
3,cold,no,no,no,no,no,
4,cold,no,yes,yes,yes,yes,


In [50]:
train_full.head()

Unnamed: 0,season,pneu,inf,dysp,cough,fever,pain,nasal,text
0,cold,no,yes,no,yes,none,no,yes,
1,cold,no,yes,no,yes,none,no,yes,
2,warm,no,no,no,no,low,no,no,
3,cold,no,no,no,no,low,no,no,
4,cold,no,yes,yes,yes,none,no,yes,


In [51]:
import pickle
with open(f"data/train_4000_final.p", "wb") as file: 
    pickle.dump(train_full_masked, file)
with open(f"data/train_4000_final_fever_pain.p", "wb") as file: 
    pickle.dump(train_full, file)

### Test set

For the test, we only remove the symptoms "fever" and "pain" (step 4a). We don't do steps 4b and 4c. 

In [54]:
test_masked = test[["season", "pneu", "inf", "dysp", "cough", "nasal", "text"]]

In [55]:
test_masked.head()

Unnamed: 0,season,pneu,inf,dysp,cough,nasal,text
0,cold,no,no,no,no,no,Patient presents with high-grade fever of 39.5...
1,warm,no,no,no,no,no,Chief complaint of persistent high fever (up t...
2,warm,no,no,no,no,no,Febrile patient with history of high-grade fev...
3,cold,no,no,no,no,no,Patient reports ongoing high fever up to 38.9°...
4,warm,no,no,no,no,no,Examined patient exhibiting high fever spikes ...


Again, we save both this final version of the test set, and the version where fever and pain are observed, which will be used to evaluate the BN++ model. 

In [56]:
import pickle
with open(f"data/test_1000_final.p", "wb") as file: 
    pickle.dump(test_masked, file)
with open(f"data/test_1000_final_fever_pain.p", "wb") as file: 
    pickle.dump(test, file)

## Add BioLORD embeddings

We use the SentenceTransformer class from Huggingface to generate BioLORD embeddings for the clinical notes in both our train and test set. We calculate these upfront and store them in the train and test dataframes for easier use in our models. 

In [5]:
import pickle
with open(f"data/train_4000_final.p", "rb") as file: 
    train_df = pickle.load(file)
with open(f"data/test_1000_final.p", "rb") as file: 
    test_df = pickle.load(file)

In [4]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("FremyCompany/BioLORD-2023")

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
def encode_text_BioLORD(row, model):
    if row.name % 25 == 0: # print index to show progress
        print(row.name)
    
    sent = row["text"]

    sent_emb = model.encode(sent)

    return sent_emb 

In [None]:
train_df["BioLORD emb"] = train_df.apply(encode_text_BioLORD, args=(model,), axis=1)

In [None]:
test_df["BioLORD emb"] = test_df.apply(encode_text_BioLORD, args=(model,), axis=1)

In [8]:
import pickle
with open(f"data/test_1000_final.p", "wb") as file: 
    pickle.dump(test_df, file)
with open(f"data/train_4000_final.p", "wb") as file: 
    pickle.dump(train_df, file)