# Trimming the Lab Event Data
One of the files that is available through MIMIC-IV and is expected to be important for our analysis is *labevents.csv*. As indicated by the name, this file contains laboratory measurements for the different patients in the dataset. Unfortunately, *labevents.csv* is quite large, which prevents us from uploading the file to the GitHub repository or storing the data in a single pandas dataframe. For these reasons, in this notebook we reduce the size of *labevents.csv* to a more manegable level. The first step in achieving this goal is eliminating all rows pertaining to patients that **do not** have kidney disease. Second, by sorting through *d_labitems.csv*, Amelia found that only $332$ of the laboratory measurements are relevant to our study. The relevant laboratory measurements are listed in *Reduced_labitems.csv*. We use the `itemid` of these $332$ laboratory measurements to further reduce the size of *labevents.csv*. Importantly, neither of these two steps eliminate information that is useful for our study. 

In [14]:
import pandas as pd
import json

The next cell executes the first step in the reduction. Specifically, we process *labevents.csv* in chunks, where in each chunk we look for the `subject_id` of patients that were diagnosed with kidney disease. The rows corresponding to such patients are saved in a dataframe called `labevents_df`. 

In [15]:
with open('../Data/unique_subject_ids.json', 'r') as file:
    unique_subject_ids = json.load(file)

labevents_cols = ['labevent_id',
                 'subject_id',
                 'hadm_id',
                 'specimen_id',
                 'itemid',
                 'order_provider_id',
                 'charttime',
                 'storetime',
                 'value',
                 'valuenum',
                 'valueuom',
                 'ref_range_lower',
                 'ref_range_upper',
                 'flag',
                 'priority',
                 'comments']
labevents_df = pd.DataFrame(columns = labevents_cols)
chunk_size = 10**6 
for chunk in pd.read_csv('../Data/labevents.csv', chunksize = chunk_size):
    for n in unique_subject_ids:
        chunk_reduced = chunk.loc[chunk['subject_id'] == n]
        if chunk_reduced.empty:
            pass
        else:
            labevents_df = pd.concat([labevents_df, chunk_reduced])

  labevents_df = pd.concat([labevents_df, chunk_reduced])


The remainder of this notebook focuses on the second step in the reduction. This entails getting the `itemid` values from *Reduced_labitems.csv*, then dropping all rows of `labevents_df` that are not associated with an `itemid` that is relevant to our study. The result is written to a csv file called *kidney_disease_patients_labs.csv*.

In [42]:
reduced_labitems_df = pd.read_csv('../Data/Reduced_labitems.csv')
labs_itemid = list(reduced_labitems_df['itemid'])
kidney_disease_patients_labs_df = pd.DataFrame(columns = labevents_cols)
for n in labs_itemid:
    lab_n_df = labevents_df.loc[labevents_df['itemid'] == n]
    kidney_disease_patients_labs_df = pd.concat([kidney_disease_patients_labs_df, lab_n_df])

  kidney_disease_patients_labs_df = pd.concat([kidney_disease_patients_labs_df, lab_n_df])


In [43]:
kidney_disease_patients_labs_df.to_csv('../Data/kidney_disease_patients_labs.csv', index = False)
print(labevents_df.shape)
print(kidney_disease_patients_labs_df.shape)

(7453719, 16)
(6050074, 16)


Based on the output of the cell above, the second step in the reduction eliminated $1,403,645$ rows from the dataframe. 