# Analysis of Diagnosis Data
This notebook explores the diagnosis data that is available through MIMIC-IV. In particular, having determined the rows of *d_icd_diagnoses.csv* that relate to kidney disease, our first task is identifying the `subject_id` associated with patients that have been diagnosed with kidney disease. Once this has been accomplished, our next task is determining what other diagnosis codes are associated with these patients. In other words, out of the patients with kidney disease, we look for what other medical conditions the patient has. This information will be used to finalize the dataset used in our study. 


In [69]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import json

We begin by loading *d_icd_diagnoses.csv* into a pandas dataframe and viewing the first few rows. Notice that the index value $n$ corresponds to row $n+2$ in the csv file. This observation must be kept in mind when accessing the rows of the csv file that Amelia identified as being related to kidney disease. Once loaded, we get the corresponding `icd_code`, `icd_version`, and the `long_title`.

In [70]:
diagnoses_codes_df = pd.read_csv('../Data/d_icd_diagnoses.csv')
diagnoses_codes_df.head()

Unnamed: 0,icd_code,icd_version,long_title
0,10,9,Cholera due to vibrio cholerae
1,11,9,Cholera due to vibrio cholerae el tor
2,19,9,"Cholera, unspecified"
3,20,9,Typhoid fever
4,21,9,Paratyphoid fever A


In [71]:
kidney_disease_indices = [2668, 2670, 6001, 6002]
kidney_disease_indices += [n for n in range(4660, 4678)]
kidney_disease_indices += [n for n in range(5956, 6000)]
kidney_disease_indices += [n for n in range(6026, 6031)]
kidney_disease_indices += [n for n in range(6086, 6089)]
kidney_disease_indices += [n for n in range(6609, 6614)]
kidney_disease_indices = [n - 2 for n in kidney_disease_indices]  # Account for the difference between the pandas index and row number in the csv file

In [72]:
icd_codes = list(diagnoses_codes_df['icd_code'].iloc[kidney_disease_indices])
icd_versions = list(diagnoses_codes_df['icd_version'].iloc[kidney_disease_indices])
diagnoses_title = list(diagnoses_codes_df['long_title'].iloc[kidney_disease_indices])

With the relevant `icd_code` and `icd_version` at hand, we now turn to the file *diagnoses_icd.csv*. This file contains the `subject_id` along with any diagnoses that were made. Our interest is extracting the `subject_id` of patients that were diagnosed with kidney disease. 

In [73]:
diagnoses_df = pd.read_csv('../Data/diagnoses_icd.csv')
diagnoses_df.head()

Unnamed: 0,subject_id,hadm_id,seq_num,icd_code,icd_version
0,10000032,22595853,1,5723,9
1,10000032,22595853,2,78959,9
2,10000032,22595853,3,5715,9
3,10000032,22595853,4,7070,9
4,10000032,22595853,5,496,9


In [74]:
subject_ids = []
for n in range(len(icd_codes)):
    subject_ids += list(diagnoses_df.loc[(diagnoses_df['icd_code'] == icd_codes[n]) & (diagnoses_df['icd_version'] == icd_versions[n]), 'subject_id'])
unique_subject_ids = list(set(subject_ids))
print(len(subject_ids))
print(len(unique_subject_ids))

20727
4763


Based on the output of the cell above, kidney disease was diagnosed $20,727$ times. However, only $4,763$ unique patients received such a diagnosis. For these $4,763$ patients, we now identify all diagnoses that were made. The result is stored in the dictionary labeled `kidney_disease_patients`. Note that the keys are the `subject_id`, while the values are tuples of the form: (`seq_num`, `icd_code`, `icd_version`). The entries in the tuple are lists, as there are cases where more than one diagnosis was made.

In [75]:
kidney_disease_patients = {}
for n in range(len(unique_subject_ids)):
    df_now = diagnoses_df.loc[diagnoses_df['subject_id'] == unique_subject_ids[n]][['seq_num', 'icd_code', 'icd_version']]
    kidney_disease_patients[unique_subject_ids[n]] = (list(df_now['seq_num']),
                                                     list(df_now['icd_code']),
                                                     list(df_now['icd_version']))




In [77]:
with open('../Data/kidney_disease_patients.json', 'w') as file:
    json.dump(kidney_disease_patients, file)

with open('../Data/unique_subject_ids.json', 'w') as file:
    json.dump(unique_subject_ids, file)