## This notebook is to load the annotated dataset and work with it.

Input: a EHR note (MIMIC)

Instruction: Please identify 5~10 word tokens from the EHR note. Those 5~10 word tokens should be most important for a patient to understand their clinical conditions, procedures, and treatment plans.

Output: keywords [use either human annotated, or just use MIMIC outputs
[overlapping keywords between notes and discharge instructions- advantage: llama and GPT4]

## ======================================= Process annotation notes

## ========================================== Process with Rankings

In [1]:
import pandas as pd
import random

pages = ['victoria', 'jiaping', 'jinying-3A', 'jinying-2B-part1', 'jinying_2B_part2','jinying_3C']

datas = []
for page in pages :
    # df = pd.read_excel("../data/raw/all_20160531.xlsx", page)
    df = pd.read_excel("../data/processed/all_20160531_changed.xlsx", page)
    datas.append(df)

In [2]:
def preprocess_data(data, topn, ids = []) :
    data = data.dropna(subset=['ranking']).copy()
    # data = data.drop(columns = ['packet'])

    data['fileid'] = data['fileid'].str.lower()
    data['fileid'] = data['fileid'].str.strip()
    data['fileid'] = data.fileid.str.replace('heart failure', 'heart_failure')
    data['fileid'] = data.fileid.str.replace('liver failure', 'liver_failure')
    data['fileid'] = data['fileid'].str.replace(" ", "")
    data['fileid'] = data['fileid'].apply(lambda x : x + '.txt' if not x.endswith('.txt') else x)

    # process phrases
    data['phrase'] = data['phrase'].str.lower()
    data['phrase'] = data['phrase'].str.strip()

    # process colors 
    data['color'] = data['color'].str.lower()
    data['color'] = data['color'].replace({"yellow" : "y", "green" : "g"})

    data = data.dropna()

    # symptoms
    symptoms = data[data.color == 'y'].sort_values(by=['fileid', 'ranking']).groupby(['fileid','color']).head(topn)
    symptoms = symptoms.drop_duplicates(subset=['fileid', 'phrase'])

    # tests
    labtests = data[data.color == 'g'].sort_values(by=['fileid', 'ranking'])
    labtests = labtests[labtests.ranking < (topn + 1)].copy()

    result = pd.concat([symptoms, labtests], ignore_index=True)
    result = result.sort_values(by=['fileid','ranking'])

    # grouped = data.sort_values(by=['fileid','ranking'])
    # result = grouped.groupby(['fileid']).head(topn)

    result = result.drop_duplicates(subset=['fileid', 'ranking'])
    result = result[~result.fileid.isin(ids)].reset_index(drop=True)
    return result.reset_index(drop=True), result['fileid'].unique().tolist()

In [3]:
# don't erase this!
# These are the examples that we will use

example_text = "cancer.report24.txt"
example_texts = ["cancer.report24.txt", "diabetes.report310966.txt"]

In [4]:
processed_datasets_top3 = []
ids = []

# leave 20 samples for finetuning
random.seed(52)
sample_size = 20


for data in datas :
    processed, data_ids = preprocess_data(data, 3, ids)
    processed_datasets_top3.append(processed)
    ids.extend(data_ids)


sample_ids = random.sample(ids, sample_size)
top3_dataset = pd.concat(processed_datasets_top3, ignore_index=True)
top3_dataset = top3_dataset[~(top3_dataset.fileid.isin(example_texts) | (top3_dataset.fileid.isin(sample_ids)))].reset_index(drop=True)
# top3_samples = top3_dataset[top3_dataset.fileid.isin(sample_ids)]


processed_datasets_top5 = []
ids = []
for data in datas :
    processed, data_ids = preprocess_data(data, 5, ids)
    processed_datasets_top5.append(processed)
    ids.extend(data_ids)


top5_dataset = pd.concat(processed_datasets_top5, ignore_index=True)
top5_dataset = top5_dataset[~(top5_dataset.fileid.isin(example_texts) | (top5_dataset.fileid.isin(sample_ids)))].reset_index(drop=True)
# top5_samples = top5_dataset[top5_dataset.fileid.isin(sample_ids)]

processed_datasets_top10 = []
ids = []
for data in datas :
    processed, data_ids = preprocess_data(data, 10, ids)
    processed_datasets_top10.append(processed)
    ids.extend(data_ids)

top10_dataset = pd.concat(processed_datasets_top10, ignore_index=True)
top10_dataset = top10_dataset[~(top10_dataset.fileid.isin(example_texts) | (top10_dataset.fileid.isin(sample_ids)))].reset_index(drop=True)
# top10_samples = top10_dataset[top10_dataset.fileid.isin(sample_ids)]

processed_datasets_top20 = []
ids = []
for data in datas :
    processed, data_ids = preprocess_data(data, 20, ids)
    processed_datasets_top20.append(processed)
    ids.extend(data_ids)

In [5]:
sample_df = pd.concat(processed_datasets_top20)
sample_df = sample_df[sample_df.fileid.isin(sample_ids)].copy()

Now merge with notes

In [6]:
import pandas as pd

notes = pd.read_pickle("../data/processed/notes.pkl")
notes

Unnamed: 0,category,noteid,text
0,liver_failure,liver_failure.report37286.txt,This is a 50-year-old male with a history of d...
1,liver_failure,liver_failure.report37775.txt,Dr. name has discussed these results with you....
2,liver_failure,liver_failure.report38874.txt,"F/u on Osteoarthritis, chronic pain, HTN, Depr..."
3,liver_failure,liver_failure.report41972.txt,Very high a1c and glucose please follow up in ...
4,liver_failure,liver_failure.report51432.txt,name is a lovely just turned 65-year-old gentl...
...,...,...,...
281,heart_failure,heart_failure.report80980.txt,1. Multifactorial anemia secondary to both ren...
282,heart_failure,heart_failure.report85881.txt,The patient is being seen for an initial evalu...
283,heart_failure,heart_failure.report9402.txt,This 86-year-old coming in for a complete chec...
284,heart_failure,heart_failure.report94858.txt,The patient presents today for evaluation for ...


In [7]:
print(notes[notes.noteid == 'diabetes.report310966.txt']['text'][205])

name name is a 58-year-old man referred for evaluation of thrombocytosis. Prior CBCs from xxxx through xxxx revealed platelet counts ranging from 400,000 to 500,000, but no more recent studies are available. He has long-standing Crohn disease and although he says he has not had gastrointestinal bleeding in the past, he has been given iron, which he is taking twice daily. He has black stool, but notes no blood and he has not had hematemesis. He notes no blood in his urine or sputum and he has no epistaxis. He discontinued the use of iron yesterday because he thought that might alleviate his gastrointestinal complaints, but he does not feel different today. He is cared for by Dr. name name at St. hospital Medical Center in location. He has no history of prior cancers, tuberculosis or other infectious diseases. He has been taking budesonide for his Crohn disease. He has no unexplained fevers, although he states he often feels hot. He has no soaking sweats and has not had unexplained weigh

In [8]:
notes['noteid'].nunique()

286

In [9]:
import pickle

# the data are 105 dataset
file_top3 = top3_dataset['fileid'].unique()
# file_top5 = top3_dataset['fileid'].unique()

filtered_notes = notes[notes.noteid.isin(file_top3)].reset_index(drop=True)

sample_notes = notes[notes.noteid.isin(sample_ids)].reset_index(drop=True)

processed_ranking_datasets = (top3_dataset, top5_dataset, top10_dataset, filtered_notes, sample_df, sample_notes)

with open("../data/processed/processed_ranking_datasets.pkl", 'wb') as f :
    pickle.dump(processed_ranking_datasets, f)

In [10]:
print(notes[~notes.noteid.isin(example_texts)]['text'].values[0])

This is a 50-year-old male with a history of diarrhea-predominant irritable bowel syndrome, who is coming in complaining of a one-month history of abdominal cramping in the epigastric region. The patient reports that his IBS was diagnosed two to three years ago when he presented with a history of alternating constipation and diarrhea post stressful events. He reports that his usual IBS flares last for 24-48 hours after a clear stressor and then spontaneously resolve. He reports that he had a sigmoidoscopy at that time, which was unremarkable, to rule out other pathologies such as IBD. He also has in the intervening time had a screening colonoscopy, where he underwent a polypectomy about two to three years ago, during which time he was also again found to not have any lesions consistent with inflammatory bowel disease. The patient reports that his IBS has been well controlled. About a month ago, he purchased a plane with the intent of renewing his pilot's license and has been stressed f

In [11]:
text = notes[notes.noteid == 'liver_failure.report38874.txt'].values[0][2]
print(text[:100])

F/u on Osteoarthritis, chronic pain, HTN, Depression 

The pt has generalized OA, in his shoulders, 


In [12]:
sample_df[(sample_df.fileid == 'cancer.report24.txt') & (sample_df.ranking < 4)]

Unnamed: 0,packet,fileid,phrase,color,ranking
0,1B,cancer.report24.txt,rheumatoid arthritis,y,1.0
1,1B,cancer.report24.txt,methotrexate,g,1.1
2,1B,cancer.report24.txt,synovitis,g,1.2
3,1B,cancer.report24.txt,osteoporosis,y,2.0
4,1B,cancer.report24.txt,anormal bone density study,g,2.1
5,1B,cancer.report24.txt,biophosphonates,g,2.2
6,1B,cancer.report24.txt,niacin,g,3.1
7,1B,cancer.report24.txt,statins wichich caused myalgias,g,3.2


In [13]:
def format_prompt(fileid, topN) :
    # get the EHR note
    ehr_note = notes[notes.noteid == fileid]
    input_text = ehr_note.values[0][2]

    # format the outputs 
    df = sample_df[(sample_df.fileid == fileid) & (sample_df.ranking < (topN+1))].copy()
    output_text = ""

    for idx, row in df.iterrows() :
       output_text += str(row.ranking) + " " + row.phrase + "\n" 
    
    instruction = f""" 
    You are a helpful assistant, an expert in medical domain. 
    Extract top {topN} main diagnosis/symptoms or conditions mentioned in the medical note. 
    Following the diagnosis/symptoms or conditions, identify the medical tests related to it.
    If there isn't any medical tests related to it, just start listing the next important diagnosis/symptoms or conditions.
    If there are no additional diagnosis/symptoms or conditions that you can identify, just list the existing ones and finalize the output. 
    Don't write no symptoms, or any indication that there is no other diagnosis/symptoms or conditions.
    Do not modify or abbreviate what is written in the notes. Just extract them as they are.
    Make sure the highest priority is assigned with a smaller number.
    We give you an example, do follow as below.
    The format should be as follows


    1. key symptom or condition
    1.1 medical test related to 1
    1.2 medical test related to 1

    2. key symptom or condition
    2.1 medical test related to 2

    3. key symptom or condition
    3.1 medical test related to 3
    3.2 medical test related to 3
    """

    return {"input" : input_text,
            "output" : output_text, 
            "instruction" : instruction}
    

In [14]:
parsed_dataset = []
for fileid in sample_notes['noteid'].tolist() :
    for topn in [3, 5, 10] :
        output = format_prompt(fileid, topn)
        parsed_dataset.append(output)

In [15]:
parsed_dataset[0]

{'input': "name is a lovely 53-year-old gentleman who I have not seen in a long time coming in today for followup.1. Nonischemic cardiomyopathy: The patient, in date, had a Persantine thallium for other reasons that noted an ejection fraction of 47% and there was possibly an LAD ischemia. \nHis xxx, heart catheterization showed a completely clean coronary system, but a low normal EF. \nChest x-rays did not suggest sarcoidosis and iron saturation was unremarkable. \nHis last echo was in xxx showing a normal ejection fraction. He has been maintained on beta-blockers and ACE inhibitors for blood pressure, but it is unclear if he needs as much medicine as he is on. He denies paroxysmal nocturnal dyspnea, orthopnea or pedal edema. No chest pain or palpitations. \n2. History of GERD: The patient had an EGD in date that showed no evidence of name esophagus. \nOutlet type rectal bleeding in date, colonoscopy showed just hemorrhoids. He has no abdominal pain, vomiting or nausea or bright red bl

In [16]:
import json
from utils import *

config = load_config()
PROJECT_PATH = config.project_path
DATA_PATH = PROJECT_PATH.joinpath("data/processed")

with open(DATA_PATH.joinpath("jinying_sample_dataset.json"), "w") as f : 
    json.dump(parsed_dataset, f)