## This notebook is to load the annotated dataset and work with it.

Input: a EHR note (MIMIC)

Instruction: Please identify 5~10 word tokens from the EHR note. Those 5~10 word tokens should be most important for a patient to understand their clinical conditions, procedures, and treatment plans.

Output: keywords [use either human annotated, or just use MIMIC outputs
[overlapping keywords between notes and discharge instructions- advantage: llama and GPT4]

In [1]:
from utils import *

config = load_config()
projectPath = config.project_path
dataPath = config.get_data_path("annotated_examples_path")

# lfPath = dataPath.joinpath("LiverFailure")
notesPath = dataPath[0].joinpath("cancer")
rawannPath = dataPath[1].joinpath("rankings_300notes_20151217_cancer.txt")

# notesPath
dataPath[0], dataPath[1]

there are ...
['cancer', 'copd', 'diabetes', 'heart_failure', 'hypertension', 'liver_failure']


(PosixPath('/data/data_user/annotations/Jinying/notesAid_286notes/clean_raw_286notes'),
 PosixPath('/data/data_user/annotations/Jinying/notesAid_286notes/rawAnn'))

In [13]:
import pandas as pd

annotated_dicts = []
for datas in ['liver_failure', 'copd', 'hypertension', 'cancer', 'diabetes', 'heart_failure'] :
    path = dataPath[0].joinpath(datas)
    files = os.listdir(path)
    for file in files : 
        with open(path.joinpath(file), 'r') as f :
            text = f.read()
        
        annotated_dicts.append({"category" : datas, "noteid" : file, "text" : text})

notes = pd.DataFrame(annotated_dicts)
notes.to_pickle(Path(projectPath).joinpath("data/processed/notes.pkl"))

In [14]:
notes

Unnamed: 0,category,noteid,text
0,liver_failure,liver_failure.report37286.txt,This is a 50-year-old male with a history of d...
1,liver_failure,liver_failure.report37775.txt,Dr. name has discussed these results with you....
2,liver_failure,liver_failure.report38874.txt,"F/u on Osteoarthritis, chronic pain, HTN, Depr..."
3,liver_failure,liver_failure.report41972.txt,Very high a1c and glucose please follow up in ...
4,liver_failure,liver_failure.report51432.txt,name is a lovely just turned 65-year-old gentl...
...,...,...,...
281,heart_failure,heart_failure.report80980.txt,1. Multifactorial anemia secondary to both ren...
282,heart_failure,heart_failure.report85881.txt,The patient is being seen for an initial evalu...
283,heart_failure,heart_failure.report9402.txt,This 86-year-old coming in for a complete chec...
284,heart_failure,heart_failure.report94858.txt,The patient presents today for evaluation for ...


In [41]:
myList= []
for file in dataPath[1].iterdir() :
    myList.append(pd.read_csv(file, sep='\t'))
    
annotation_info_table = pd.concat(myList)

In [43]:
annotation_info_table['noteid'] = annotation_info_table.apply(lambda x : x['note'] + ".txt", axis=1)

In [47]:
reports = set(annotation_info_table['noteid'].unique())

Now filter ones with the reports

In [51]:
filtered_notes = notes[notes.noteid.isin(reports)]

============================================
- Now we save the processed notes and annotation information

In [53]:
filtered_notes.to_pickle("../data/processed/filtered_notes.pkl")
annotation_info_table.to_pickle("../data/processed/annotation_info_table.pkl")

## ======================================= Process annotation notes

In [3]:
import pandas as pd
pages = ['victoria', 'jiaping', 'john', 'jinying-3A', 'jinying-2B-part1', 'jinying_2B_part2','jinying_3C']

datas = []
for page in pages :
    df = pd.read_excel("../data/raw/all_20160531.xlsx", page)
    df = df[['File N', 'Phrase']].copy()
    datas.append(df)

In [4]:
datas = pd.concat(datas, ignore_index=True).dropna()
datas = datas.reset_index(drop=True)

In [5]:
datas.rename(columns = {'File N':'note'}, inplace=True)

In [72]:
datas['note'] = datas['note'].str.replace(" ", "")
# group by notes

datas['note'] = datas['note'].str.lower()

datas['note'] = datas['note'].apply(lambda x : x.replace("heartfailure","heart_failure"))
msk = datas['note'].str.endswith("txt")

d1 = datas[msk].copy()
d2 = datas[~msk].copy()

d2['note'] = d2['note'].apply(lambda x : x + ".txt" if '(lab)' not in x else x)

combinedData = pd.concat([d1,d2], ignore_index=True).copy()

combinedData['Phrase'] = combinedData['Phrase'].apply(lambda x : x + ", ")
combinedData = combinedData.groupby('note', as_index=False)['Phrase'].sum()

combinedData['Phrase'] = combinedData['Phrase'].str.strip()
combinedData = combinedData.rename(columns = {'note' : 'noteid'})
combinedData.head()

Unnamed: 0,noteid,Phrase
0,cancer.report11.txt,"Large B-cell Lymphoma, chemo, DM2, diet-contro..."
1,cancer.report13.txt,"autoimmune hemolytic anemia, Marginal zone lym..."
2,cancer.report14.txt,"arthritis, DM, Depression, low grade follicula..."
3,cancer.report15.txt,"malignant large B cell diffuse lymphoma, CHOP ..."
4,cancer.report18.txt,"mediastinal mass, stage III, T-cell lymphoblas..."


In [65]:
notes = pd.read_pickle("../data/processed/notes.pkl")
notes.head()

Unnamed: 0,category,noteid,text
0,liver_failure,liver_failure.report37286.txt,This is a 50-year-old male with a history of d...
1,liver_failure,liver_failure.report37775.txt,Dr. name has discussed these results with you....
2,liver_failure,liver_failure.report38874.txt,"F/u on Osteoarthritis, chronic pain, HTN, Depr..."
3,liver_failure,liver_failure.report41972.txt,Very high a1c and glucose please follow up in ...
4,liver_failure,liver_failure.report51432.txt,name is a lovely just turned 65-year-old gentl...


In [75]:
mergedData = notes.merge(combinedData)

In [77]:
mergedData.to_pickle("../data/processed/mergedData.pkl")

In [1]:
import pandas as pd
mergedData = pd.read_pickle("../data/processed/mergedData.pkl")

In [2]:
print(mergedData.shape)
mergedData.head()

(106, 4)


Unnamed: 0,category,noteid,text,Phrase
0,liver_failure,liver_failure.report37286.txt,This is a 50-year-old male with a history of d...,"Diarrhea-predominant, irritable bowel syndrome..."
1,liver_failure,liver_failure.report41972.txt,Very high a1c and glucose please follow up in ...,"a1c, diabetes, CLYCOHEMOGLOBIN A1C, HGBA1C,"
2,liver_failure,liver_failure.report51432.txt,name is a lovely just turned 65-year-old gentl...,"patellofemoral syndrome, physical therapy, Act..."
3,liver_failure,liver_failure.report55225.txt,name is a lovely 53-year-old gentleman who I h...,"Nonischemic cardiomyopathy, Persantine thalliu..."
4,liver_failure,liver_failure.report60517.txt,"NPOV \n\n67 year old male with HTN, Chronic Al...","HTN, Chronic Alcohol Abuse, remission, Chronic..."


In [7]:
print(mergedData['text'][0], "\n\n", mergedData['Phrase'][0])

This is a 50-year-old male with a history of diarrhea-predominant irritable bowel syndrome, who is coming in complaining of a one-month history of abdominal cramping in the epigastric region. The patient reports that his IBS was diagnosed two to three years ago when he presented with a history of alternating constipation and diarrhea post stressful events. He reports that his usual IBS flares last for 24-48 hours after a clear stressor and then spontaneously resolve. He reports that he had a sigmoidoscopy at that time, which was unremarkable, to rule out other pathologies such as IBD. He also has in the intervening time had a screening colonoscopy, where he underwent a polypectomy about two to three years ago, during which time he was also again found to not have any lesions consistent with inflammatory bowel disease. The patient reports that his IBS has been well controlled. About a month ago, he purchased a plane with the intent of renewing his pilot's license and has been stressed f