# Goal of this notebook:
 - Filter notes by ICD code indicating SAH:
   - ICD +
   - ICD -
 - Plan to do so:
   - For ICD +, filter using regex to include only ICD codes that I have listed
   - For ICD -, do the opposite
   - Remember that there are more rows than unique patients.  ie: some patients will have multiple ICD codes all linked to the same note
   - To fix that - try to condense rows into one per note by making the ICD column contain a list of ICD's (not tidy but I feel like it works here)

### ICD +
 - Filter through the icd data by the 'ICDCD' column with codes that match ^(I60|430).*
 - I will be looking at all ICD codes assigned +/- ! month the date of the discharge summary
 - Check the format of the dates
 - Merge based on patient ID and date

# Overall Plan
1 - All ICD+/- assignments separated. 
- Relavent info icd csv contains: BDSPPatientID, DateAssigned ('ShiftedContactDTS'), ICD ('ICDCD')
- Relavent info notes df contains: BDSPPatientID, NoteType ('NoteTypeFull'), DateWritten ('CreateDate'), NoteTextFileName ('DeidentifiedName')

In [24]:
# imports
import pandas as pd
from thunderpack import ThunderReader
from tqdm import tqdm
import re

In [2]:
reader = ThunderReader('/home/jsearle/bigDrive/Dropbox/zz_EHR_Thunderpacks/MGB/thunderpack_icd_9_10_1m_MGB')
key_length = len(list(reader.keys()))
print(key_length)

511


In [3]:
# filter by below icd codes
code_regex = '^(I60|430)'
dfs = []
for i in tqdm(range(1, key_length + 1)):
    df = reader[f'ICD_partition_{i}']
    df = df[df['ICDCD'].astype(str).str.match(code_regex, flags = re.I)]
    dfs.append(df)

100%|██████████| 511/511 [23:58<00:00,  2.81s/it]


In [4]:
# show num of total SAH icd codes given
# show example of df
filtered_icd_df = pd.concat(dfs, axis=0, ignore_index=True)
print(len(filtered_icd_df))
filtered_icd_df.head()

41796


Unnamed: 0,BDSPEncounterID,EncounterLineNBR,BDSPPatientID,ShiftedContactDTS,ICDLineNBR,ICDCD,ICDDSC,DiagnosisNM,DiagnosisDSC,PrimaryDiagnosisFLG,DiagnosisChronicFLG,ShiftedUpdateDTS,DiagnosisLinkedProblemID,BDSPLastModifiedDTS,code_type
0,13437640000.0,2,116398048.0,2018-03-09 00:00:00.0000000,1.0,430,Subarachnoid hemorrhage,Subarachnoid hemorrhage,,N,N,2019-07-26 09:49:00.0000000,52214845.0,2022-04-27 13:27:03.6830000,ICD9
1,13394370000.0,2,119744866.0,2019-10-20 00:00:00.0000000,1.0,430,Subarachnoid hemorrhage,SAH (subarachnoid hemorrhage),,N,N,2023-04-28 12:05:00.0000000,81182497.0,2023-08-16 01:27:07.9010000,ICD9
2,13584730000.0,1,116790672.0,2022-05-13 00:00:00.0000000,1.0,430,Subarachnoid hemorrhage,SAH (subarachnoid hemorrhage),,N,N,2022-05-24 09:42:00.0000000,,2022-04-27 15:51:06.4400000,ICD9
3,13556260000.0,1,122243491.0,2020-06-12 00:00:00.0000000,1.0,430,Subarachnoid hemorrhage,Subarachnoid hemorrhage,,Y,N,2020-06-12 10:21:00.0000000,96911003.0,2022-04-27 13:21:20.6900000,ICD9
4,13544450000.0,5,119133865.0,2021-03-07 00:00:00.0000000,1.0,430,Subarachnoid hemorrhage,Subarachnoid hemorrhage,,N,N,2021-03-07 14:15:00.0000000,54423881.0,2022-04-27 14:08:48.7770000,ICD9


In [14]:
# clean up df, keep only relevant info
keepColumns = ['BDSPPatientID', 'ShiftedContactDTS', 'ICDCD']
clean_icd_df = filtered_icd_df[keepColumns]
clean_icd_df.head()

Unnamed: 0,BDSPPatientID,ShiftedContactDTS,ICDCD
0,116398048.0,2018-03-09 00:00:00.0000000,430
1,119744866.0,2019-10-20 00:00:00.0000000,430
2,116790672.0,2022-05-13 00:00:00.0000000,430
3,122243491.0,2020-06-12 00:00:00.0000000,430
4,119133865.0,2021-03-07 00:00:00.0000000,430


In [22]:
# rename columns
rename_dict = { 
    'ShiftedContactDTS': 'DateICD', 
    'ICDCD': 'ICD', 
}

clean_icd_df = clean_icd_df.rename(columns=rename_dict)

clean_icd_df.head()


<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [23]:
# save df as csv
clean_icd_df.to_csv('1_icd_plus_df.csv', index=False)

### ICD -
 - Filter through the icd data by the 'ICDCD' column with codes that *don't* match ^(I60|430).*
 - I will be looking at all ICD codes assigned +/- ! month the date of the discharge summary
 - Check the format of the dates
 - Merge based on patient ID and date

In [25]:
# filter by below not these icd codes
code_regex = '^(?!I60|430)'
dfs = []
for i in tqdm(range(1, key_length + 1)):
    df = reader[f'ICD_partition_{i}']
    df = df[df['ICDCD'].astype(str).str.match(code_regex, flags = re.I)]
    dfs.append(df)

 10%|█         | 53/511 [02:44<24:18,  3.18s/it]

: 

In [None]:
# show num of total non-SAH icd codes given
# show example of df
filtered_icd_df = pd.concat(dfs, axis=0, ignore_index=True)
print(len(filtered_icd_df))
filtered_icd_df.head()

In [None]:
# clean up df, keep only relevant info
keepColumns = ['BDSPPatientID', 'ShiftedContactDTS', 'ICDCD']
clean_icd_df = filtered_icd_df[keepColumns]
clean_icd_df.head()

In [None]:
# rename columns
rename_dict = { 
    'ShiftedContactDTS': 'DateICD', 
    'ICDCD': 'ICD', 
}

clean_icd_df = clean_icd_df.rename(columns=rename_dict)

clean_icd_df.head()