# **EMR-LIP Framework: Preprocessing Irregular Longitudinal Data Across Databases**

This document guides users on how to preprocess irregular longitudinal data in various databases using the APls provided by the EMR-LIP framework.

<u>**Important Note**: It's crucial to understand that EMR-LIP does not include data cleaning or the generation of clinical concepts. The diversity in variable statistical caliber, names, and other aspects across different databases makes it challenging to offer a universal module for automated data  cleaning and clinical concept generation in each database.</u>

<img src="./image.png" width="800">

## **Chaper 1: Example of the use of EMR-LIP and Universality of EMR-LIP**

EMR-LIP's universality lies in its ability to process data from any database. Users can provide their original irregular longitudinal data in either a long table or wide table format, along with a variable dictionary that meets EMR-LIP requirements. EMR-LIP can then swiftly convert this raw data into structured multivariate time series data. This transformation facilitates the modeling of longitudinal dependencies, such as those in LSTM networks.

### **Case Studies: MIMIC-IV and eICU-CRD**

Let's illustrate EMR-LIP's functionality and its universality across different databases using two major public databases: MIMIC-IV and eICU-CRD. We'll demonstrate how EMR-LIP rapidly processes data according to the user-defined variable dictionary and showcase its adaptability across various databases.

In [2]:
## Load the EMR-LIP framework
source("/home/luojiawei/Benchmark_project/EMR_APIs.R")

### **MIMIC-IV**

The MIMIC-IV database provides what can be described as a "semi-finished" variable dictionary. This dictionary includes information such as item IDs, labels, and abbreviations for various variables. However, this information alone is insufficient for the EMR-LIP framework to function effectively. Therefore, additional effort is required to construct a comprehensive variable dictionary.

#### **Building the Variable Dictionary**

The process of building this dictionary, though labor-intensive, is a critical step. I assure you that once this construction is complete, subsequent data processing will become significantly easier. To illustrate this, let's consider the 'chartevents ' table from MIMIC-IV. This table is a prime example of a vertical irregular data table presented in a long table format.

In [328]:
# First, read the "semi-finished" variable dictionary
d_items<-fread("/home/luojiawei/mimiciv/mimic-iv-2.2/icu/d_items.csv.gz",header=T,fill=T)

In [4]:
# Let's look at its structure
d_items[1:5,]

itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<dbl>
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,


In [5]:
# Now we build the dictionary with variables from the "Routine Vital Signs" category in the category column
d_custom <- d_items[d_items$category == "Routine Vital Signs", c("itemid", "param_type")]

In [6]:
# See value types of the variables
unique(d_custom$param_type)

In [7]:
# Because EMR-LIP cannot handle free-text, we must discard this part of the variable
d_custom <- d_custom[d_custom$param_type != "Text",]

In [319]:
# Next, we read the raw data stored in the long table format and the admissions table that holds the basic information for admission
chartevents<-fread("/home/luojiawei/mimiciv/mimic-iv-2.2/icu/chartevents.csv.gz",header=T,fill=T)
admissions<-fread("/home/luojiawei/mimiciv/mimic-iv-2.2/hosp/admissions.csv.gz",header=T,fill=T)

In [9]:
# Since chartevents are large, we only keep rows that appear in the variable dictionary in order to speed up the calculation
ds_vital <- chartevents[which(chartevents$itemid %in% d_custom$itemid),]

Examining the structure of the 'ds_vital' dataset reveals multiple columns, but only a few are essential for data preprocessing. These columns are 'hadm_id', 'charttime ', 'itemid', and 'value'. 'hadm_id' represents the ID of each sample, 'charttime' is the timestamp when each observation was recorded, 'itemid' is the ID for each observed variable, and 'value ' is the observed value for that variable.

It's important to note that we opt for the "value' column instead of "valuenum '. This choice is made because we might handle both numerical and discrete variables, and some discrete variables are stored as strings. These would appear as NA in the 'valuenum ' column.

In [10]:
ds_vital[1:5,]

subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
<int>,<int>,<int>,<int>,<dttm>,<dttm>,<int>,<chr>,<dbl>,<chr>,<int>
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0


In [11]:
# Modify the value type of the variable dictionary to satisfy the EMR-LIP
d_custom$param_type[d_custom$param_type == "Numeric"] <- "num"
d_custom$param_type[d_custom$param_type == "Checkbox"] <- "bin"

In [12]:
# Since the measurement times for these variables are all at single points, the time attribute for these variables should be set to "single".
d_custom$time_attribute <- "single"

In [13]:
# Since these variables are not a result of any medical intervention, their acquisition type should be categorized as "obs" (observational).
d_custom$acqu_type <- "obs"

In [14]:
# Specify their aggregation method, here for simplicity we specify the "last" aggregation method for all of them
d_custom$agg_f <- "last"

In [15]:
# Specifies the name of each column in the variable dictionary
names(d_custom) <- c("itemid","value_type","time_attribute", "acqu_type","agg_f")

In [16]:
d_custom[1:6,]

itemid,value_type,time_attribute,acqu_type,agg_f
<int>,<chr>,<chr>,<chr>,<chr>
220045,num,single,obs,last
220050,num,single,obs,last
220051,num,single,obs,last
220052,num,single,obs,last
220179,num,single,obs,last
220180,num,single,obs,last


With this, a valid and effective variable dictionary has been constructed, providing all the information needed for subsequent processing. Strictly speaking, the EMR-LIP program actually begins from this point. Next, we will demonstrate how to quickly preprocess samples for irregular longitudinal data.

#### **Generate the statistical information**

In [17]:
# Calculate statistics for each variable based on its value type for subsequent processing
stat_vital <- get_stat_long(ds_vital, d_custom$itemid, d_custom$value_type, "itemid", "value")

In [18]:
# Without losing generality, we choose the sample with hadm_id of 29079034 as our example (we take one admission as a sample)
ds_vital_k <- ds_vital[ds_vital$hadm_id == 29079034,]

In [19]:
# Select a time starting point that will serve as a reference time point for relative time generation, usually the patient's time of admission
time_ref <- admissions$admittime[admissions$hadm_id==29079034]

In [20]:
# Convert the time of each observation in the long table to a relative time with reference to time_ref
ds_vital_k$charttime_r <- as.numeric(difftime(ds_vital_k$charttime, time_ref, unit="hour"))

In [21]:
# Next we can remove some extreme values
ds_vital <- remove_extreme_value_long(ds_vital, d_custom$itemid, d_custom$value_type, "itemid", "value", neg_valid = F)

In [22]:
# Recalculate the statistics of the variables
stat_vital <- get_stat_long(ds_vital, d_custom$itemid, d_custom$value_type, "itemid", "value")

In [23]:
stat_vital[1:3]

In [24]:
# Construct an endtime column, although it has no effect on variables of type single point, but this is to unify the function interface
ds_vital_k$timecol <- "NA"
# Give a set of times you want to resample, e.g. 1 to 48 hours, with 1 hour intervals
time_range <- 1:48
# Remember, you need to sort the data by time before resampling
ds_vital_k <- ds_vital_k[order(ds_vital_k$charttime_r, decreasing = F), ]

#### **Sequence resampling**

In [25]:
# resample the long table
ds_vital_k1 <- resample_data_long(ds_vital_k, # long table that needs resampling
                                  d_custom$itemid, # id list of the variable
                                  d_custom$value_type,  # The value type list of the variable
                                  d_custom$agg_f, # aggregation method list of variables
                                  time_range,  # aggregation method list of variables
                                  "itemid",  # the column name for variable names in the long table
                                  "value",  # the column name for variable value in the long table
                                  "charttime_r",  # the column name for relative start times in the long table
                                  "timecol",  # the column name for relative end times in the long table
                                  1 # time interval/time window
                                  )

After resampling, we can see that each variable has been aligned to the given set of timestamps. Each variable occupies a separate column, and the column names correspond to the variable IDs in the variable dictionary. Additionally, there is an extra second column; if 'keep' is set to 1, it indicates that at least one variable has a non-NA observation at that timestamp.

In [26]:
ds_vital_k1[1:10,1:10]

time,keep,220045,220050,220051,220052,220179,220180,220181,223761
1,1,,,,,,,,98.7
2,1,94.0,,,,88.0,56.0,64.0,98.7
3,1,105.0,,,,91.0,55.0,64.0,98.7
4,1,97.0,,,,95.0,58.0,67.0,98.7
5,1,100.0,,,,86.0,53.0,60.0,98.7
6,1,97.0,,,,93.0,41.0,56.0,98.7
7,1,100.0,,,,90.0,57.0,64.0,99.5
8,1,94.0,,,,82.0,59.0,63.0,99.5
9,1,94.0,,,,85.0,55.0,62.0,99.5
10,1,94.0,,,,85.0,55.0,62.0,99.5


The 'get_fill_method' function takes the variable dictionary as both input and output, but it adds two columns to indicate the methods for filling missing values for each variable. The "fill" column represents the method for filling gaps when there is at least one observation of the variable within the entire time range, while the "fillall" column represents the method for filling when there are no observations of the variable across the entire time range.

In [27]:
d_custom <- get_fill_method(d_custom)

In [28]:
d_custom[1:4,]

itemid,value_type,time_attribute,acqu_type,agg_f,fill,fillall
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
220045,num,single,obs,last,locf,mean
220050,num,single,obs,last,locf,mean
220051,num,single,obs,last,locf,mean
220052,num,single,obs,last,locf,mean


#### **Data imputation**

In [29]:
ds_vital_k2 <- fill(ds_vital_k1,  # wide table to fill
                    3:ncol(ds_vital_k1),  # The index of the column to be filled
                    1,  # Index of the time column
                    d_custom$value_type,  # value type list
                    d_custom$fill,  # fill method list
                    d_custom$fillall, # fillall method list 
                    stat_vital # List of statistics (this was previously generated using the get_stat_* function)
                    )

In [30]:
ds_vital_k2[1:10,1:10]

time,220045,220050,220051,220052,220179,220180,220181,223761,223762
1,94,119.068415226851,58.7217116589301,78.7468714221984,88,56,64,98.7,36.9328261615236
2,94,119.068415226851,58.7217116589301,78.7468714221984,88,56,64,98.7,36.9328261615236
3,105,119.068415226851,58.7217116589301,78.7468714221984,91,55,64,98.7,36.9328261615236
4,97,119.068415226851,58.7217116589301,78.7468714221984,95,58,67,98.7,36.9328261615236
5,100,119.068415226851,58.7217116589301,78.7468714221984,86,53,60,98.7,36.9328261615236
6,97,119.068415226851,58.7217116589301,78.7468714221984,93,41,56,98.7,36.9328261615236
7,100,119.068415226851,58.7217116589301,78.7468714221984,90,57,64,99.5,36.9328261615236
8,94,119.068415226851,58.7217116589301,78.7468714221984,82,59,63,99.5,36.9328261615236
9,94,119.068415226851,58.7217116589301,78.7468714221984,85,55,62,99.5,36.9328261615236
10,94,119.068415226851,58.7217116589301,78.7468714221984,85,55,62,99.5,36.9328261615236


#### **Variable transformation**

In [31]:
ds_vital_k2 <- norm_num(ds_vital_k2,  # wide table that needs to be normalized
                        2:ncol(ds_vital_k2), # Index of the column that needs to be processed
                        1, # Index of the time column
                        d_custom$value_type,  # value type list
                        stat_vital # List of statistics (this was previously generated using the get_stat_* function)
                        )

In [32]:
ds_vital_k2[1:10,1:10]

time,220045,220050,220051,220052,220179,220180,220181,223761,223762
1,0.6689118132,0,0,0,-2.142094516,-0.8735569127,-1.4086559815,0.2689092536,0
2,0.6689118132,0,0,0,-2.142094516,-0.8735569127,-1.4086559815,0.2689092536,0
3,1.5908006392,0,0,0,-1.9382515751,-0.9736575371,-1.4086559815,0.2689092536,0
4,0.9203360385,0,0,0,-1.6664609872,-0.6733556641,-1.1115263477,0.2689092536,0
5,1.1717602637,0,0,0,-2.27798981,-1.1738587858,-1.8048288266,0.2689092536,0
6,0.9203360385,0,0,0,-1.8023562811,-2.3750662779,-2.2010016718,0.2689092536,0
7,1.1717602637,0,0,0,-2.0061992221,-0.7734562884,-1.4086559815,1.5609787465,0
8,0.6689118132,0,0,0,-2.5497803978,-0.5732550397,-1.5076991928,1.5609787465,0
9,0.6689118132,0,0,0,-2.3459374569,-0.9736575371,-1.6067424041,1.5609787465,0
10,0.6689118132,0,0,0,-2.3459374569,-0.9736575371,-1.6067424041,1.5609787465,0


In [33]:
ds_vital_k2 <- to_onehot(ds_vital_k2, # wide table that requires one_hot transformation
                         2:ncol(ds_vital_k2), # Index of the column that needs to be processed
                         1, # Index of the time column
                         d_custom$value_type,  # value type list
                         stat_vital # List of statistics (this was previously generated using the get_stat_* function)
                         )

$\mathbf{X}$

In [34]:
ds_vital_k2[1:4,1:10]

time,220045,220050,220051,220052,220179,220180,220181,223761,223762
1,0.6689118132,0,0,0,-2.142094516,-0.8735569127,-1.4086559815,0.2689092536,0
2,0.6689118132,0,0,0,-2.142094516,-0.8735569127,-1.4086559815,0.2689092536,0
3,1.5908006392,0,0,0,-1.9382515751,-0.9736575371,-1.4086559815,0.2689092536,0
4,0.9203360385,0,0,0,-1.6664609872,-0.6733556641,-1.1115263477,0.2689092536,0


#### **Feature engineering**

$\mathbf{M}$

In [35]:
mask_k1 <- get_mask(ds_vital_k1, 3:ncol(ds_vital_k1), 1)
mask_k1 <- shape_as_onehot(mask_k1, 2:ncol(mask_k1), 1, get_type(stat_vital), stat_vital)

In [36]:
mask_k1[1:10,1:10]

time,220045,220050,220051,220052,220179,220180,220181,223761,223762
1,0,0,0,0,0,0,0,1,0
2,1,0,0,0,1,1,1,1,0
3,1,0,0,0,1,1,1,1,0
4,1,0,0,0,1,1,1,1,0
5,1,0,0,0,1,1,1,1,0
6,1,0,0,0,1,1,1,1,0
7,1,0,0,0,1,1,1,1,0
8,1,0,0,0,1,1,1,1,0
9,1,0,0,0,1,1,1,1,0
10,1,0,0,0,1,1,1,1,0


$\mathbf{\Delta}$

In [37]:
delta_k1 <- get_deltamat(mask_k1, 2:ncol(mask_k1),1)

In [38]:
delta_k1[1:10,1:10]

time,220045,220050,220051,220052,220179,220180,220181,223761,223762
1,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,1,1,1,1
3,1,2,2,2,1,1,1,1,2
4,1,3,3,3,1,1,1,1,3
5,1,4,4,4,1,1,1,1,4
6,1,5,5,5,1,1,1,1,5
7,1,6,6,6,1,1,1,1,6
8,1,7,7,7,1,1,1,1,7
9,1,8,8,8,1,1,1,1,8
10,1,9,9,9,1,1,1,1,9


#### **Output**

In [39]:
ds_vital_k2 <- as.data.frame(ds_vital_k2) %>% lapply(., as.numeric) %>% as.data.frame
mask_k1 <- as.data.frame(mask_k1) %>% lapply(., as.numeric) %>% as.data.frame
delta_k1 <- as.data.frame(delta_k1) %>% lapply(., as.numeric) %>% as.data.frame

### **eICU-CRD**

The eICU-CRD database does not provide a variable dictionary. Therefore, users need to invest more effort in constructing it when working with the eICU-CRD database.

#### **Building the Variable Dictionary**

Transitioning from the MIMIC-IV database to eICU-CRD, the primary additional task for users is the creation of a new variable dictionary. To illustrate this, let's consider the 'vitalsign' table from eICU-CRD. This table serves as an excellent example of a vertical irregular data table in a wide table format. Here, users need not worry about the transition from a long table to a wide table format, as EMR-LIP offers equivalent solutions for both formats.

In [40]:
# First, we read the raw data stored in the wide table format and the admissions table that holds the basic information for admission
patients<-fread("/home/luojiawei/eicu/eicu-collaborative-research-database-2.0/patient.csv.gz",header=T,fill=T)
vitalsign <- fread("/home/luojiawei/eicu/eicu-collaborative-research-database-2.0/vitalPeriodic.csv.gz", header=T,fill=T)

In [45]:
# As can be seen, the biggest difference between raw data in wide table format and long table is that wide table has been initially aligned
# For wide tables, there are no columns dedicated to storing variable names and values.
vitalsign[1:3,]

vitalperiodicid,patientunitstayid,observationoffset,temperature,sao2,heartrate,respiration,cvp,etco2,systemicsystolic,systemicdiastolic,systemicmean,pasystolic,padiastolic,pamean,st1,st2,st3,icp
<int64>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<int>
37376747,141168,2059,,,92,,30.0,,,,,,,,,,,
37404957,141168,1289,,,118,,,,,,,,,,,,,
37385871,141168,1794,,91.0,78,,,,,,,,,,,,,


For wide tables, we can start by examining the types of the various columns we need to process.  From the table, it's evident that these variables are all of a numerical type, and their time attributes are all "single", with an acquisition type of "observational".  Therefore, we can manually construct the variable dictionary based on this information.

In [42]:
d_custom <- data.frame(itemid = names(vitalsign)[4:ncol(vitalsign)],
                       value_type = rep("num", ncol(vitalsign) - 4 + 1),
                       time_attribute = rep("single", ncol(vitalsign) - 4 + 1),
                       acqu_type = rep("obs", ncol(vitalsign) - 4 + 1),
                       agg_f = "last")

#### **Get the statistical information**

In [43]:
stat_vital <- get_stat_wide(vitalsign, d_custom$itemid, d_custom$value_type)

In [46]:
stat_vital[1:3]

In [47]:
# Without loss of generality, we read the sample data of patientunitstayid 128919
patients[1:4,]

patientunitstayid,patienthealthsystemstayid,gender,age,ethnicity,hospitalid,wardid,apacheadmissiondx,admissionheight,hospitaladmittime24,⋯,unitadmitsource,unitvisitnumber,unitstaytype,admissionweight,dischargeweight,unitdischargetime24,unitdischargeoffset,unitdischargelocation,unitdischargestatus,uniquepid
<int>,<int>,<chr>,<chr>,<chr>,<int>,<int>,<chr>,<dbl>,<chr>,⋯,<chr>,<int>,<chr>,<dbl>,<dbl>,<chr>,<int>,<chr>,<chr>,<chr>
141168,128919,Female,70,Caucasian,59,91,"Rhythm disturbance (atrial, supraventricular)",152.4,15:54:00,⋯,Direct Admit,1,admit,84.3,85.8,03:50:00,3596,Death,Expired,002-34851
141178,128927,Female,52,Caucasian,60,83,,162.6,08:56:00,⋯,Emergency Department,1,admit,54.4,54.4,09:18:00,8,Step-Down Unit (SDU),Alive,002-33870
141179,128927,Female,52,Caucasian,60,83,,162.6,08:56:00,⋯,ICU to SDU,2,stepdown/other,,60.4,19:20:00,2042,Home,Alive,002-33870
141194,128941,Male,68,Caucasian,73,92,"Sepsis, renal/UTI (including bladder)",180.3,18:18:40,⋯,Floor,1,admit,73.9,76.7,15:31:00,4813,Floor,Alive,002-5276


In [48]:
vitalsign_k <- vitalsign[which(vitalsign$patientunitstayid == 141168), ]

In [49]:
vitalsign_k[1:4,]

vitalperiodicid,patientunitstayid,observationoffset,temperature,sao2,heartrate,respiration,cvp,etco2,systemicsystolic,systemicdiastolic,systemicmean,pasystolic,padiastolic,pamean,st1,st2,st3,icp
<int64>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<int>
37376747,141168,2059,,,92,,30.0,,,,,,,,,,,
37404957,141168,1289,,,118,,,,,,,,,,,,,
37385871,141168,1794,,91.0,78,,,,,,,,,,,,,
37401664,141168,1374,,90.0,118,,,,,,,,,,,,,


Since the time in the table is already in relative terms, there's no need for further conversion to relative time. However, it's important to note that this time is in minutes, and we need to convert it to hours.

In [50]:
vitalsign_k$observationoffset <- vitalsign_k$observationoffset / 60

In [51]:
# Construct an endtime column, although it has no effect on variables of type single point, but this is to unify the function interface
vitalsign_k$timecol <- "NA"
# Give a set of times you want to resample, e.g. 1 to 48 hours, with 1 hour intervals
time_range <- 1:48
# Remember, you need to sort the data by time before resampling
vitalsign_k <- vitalsign_k[order(vitalsign_k$observationoffset, decreasing = F), ]

In [52]:
vitalsign_k[1:10,]

vitalperiodicid,patientunitstayid,observationoffset,temperature,sao2,heartrate,respiration,cvp,etco2,systemicsystolic,systemicdiastolic,systemicmean,pasystolic,padiastolic,pamean,st1,st2,st3,icp,timecol
<int64>,<int>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<int>,<chr>
37449174,141168,1.983333,,93.0,140,,,,,,,,,,,,,,
37448979,141168,2.066667,,,140,,,,,,,,,,,,,,
37448785,141168,2.15,,,140,,,,,,,,,,,,,,
37448594,141168,2.233333,,,140,,,,,,,,,,,,,,
37448402,141168,2.316667,,,140,,,,,,,,,,,,,,
37448210,141168,2.4,,,140,,,,,,,,,,,,,,
37448020,141168,2.483333,,,140,,,,,,,,,,,,,,
37447831,141168,2.566667,,,136,,,,,,,,,,,,,,
37447641,141168,2.65,,,132,,,,,,,,,,,,,,
37447450,141168,2.733333,,,132,,,,,,,,,,,,,,


#### **Sequence resampling**

In [53]:
# resample the long table
vitalsign_k1 <- resample_data_wide(vitalsign_k, # long table that needs resampling
                                  d_custom$itemid, # id list of the variable
                                  d_custom$value_type,  # The value type list of the variable
                                  d_custom$agg_f, # aggregation method list of variables
                                  time_range,  # aggregation method list of variables
                                  "observationoffset",  # the column name for relative start times in the long table
                                  "timecol",  # the column name for relative end times in the long table
                                  1 # time interval/time window
                                  )

In [54]:
vitalsign_k1[1:10,1:10]

time,keep,temperature,sao2,heartrate,respiration,cvp,etco2,systemicsystolic,systemicdiastolic
1,1,,,,,,,,
2,1,,93.0,140.0,,,,,
3,1,,93.0,134.0,,,,,
4,1,,75.0,134.0,,,,,
5,1,,79.0,134.0,,,,,
6,1,,84.0,134.0,,,,,
7,1,,84.0,132.0,,,,,
8,1,,84.0,132.0,,,,,
9,1,,84.0,132.0,,,,,
10,1,,84.0,130.0,,,,,


In [55]:
d_custom <- get_fill_method(d_custom)

In [56]:
d_custom[1:4,]

Unnamed: 0_level_0,itemid,value_type,acqu_type,agg_f,fill,fillall
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,temperature,num,obs,last,locf,mean
2,sao2,num,obs,last,locf,mean
3,heartrate,num,obs,last,locf,mean
4,respiration,num,obs,last,locf,mean


#### **Data imputation**

In [57]:
vitalsign_k2 <- fill(vitalsign_k1,  # wide table to fill
                    3:ncol(vitalsign_k1),  # The index of the column to be filled
                    1,  # Index of the time column
                    d_custom$value_type,  # value type list
                    d_custom$fill,  # fill method list
                    d_custom$fillall, # fillall method list 
                    stat_vital # List of statistics (this was previously generated using the get_stat_* function)
                    )

In [58]:
vitalsign_k2[1:10,1:10]

time,temperature,sao2,heartrate,respiration,cvp,etco2,systemicsystolic,systemicdiastolic,systemicmean
1,39.0070477083854,93,140,20.1089271251924,33,31.4004136699633,46,20,31
2,39.0070477083854,93,140,20.1089271251924,33,31.4004136699633,46,20,31
3,39.0070477083854,93,134,20.1089271251924,33,31.4004136699633,46,20,31
4,39.0070477083854,75,134,20.1089271251924,33,31.4004136699633,46,20,31
5,39.0070477083854,79,134,20.1089271251924,33,31.4004136699633,46,20,31
6,39.0070477083854,84,134,20.1089271251924,33,31.4004136699633,46,20,31
7,39.0070477083854,84,132,20.1089271251924,33,31.4004136699633,46,20,31
8,39.0070477083854,84,132,20.1089271251924,33,31.4004136699633,46,20,31
9,39.0070477083854,84,132,20.1089271251924,33,31.4004136699633,46,20,31
10,39.0070477083854,84,130,20.1089271251924,33,31.4004136699633,46,20,31


#### **Variable transformation**

In [59]:
vitalsign_k2 <- norm_num(vitalsign_k2,  # wide table that needs to be normalized
                        2:ncol(vitalsign_k2), # Index of the column that needs to be processed
                        1, # Index of the time column
                        d_custom$value_type,  # value type list
                        stat_vital # List of statistics (this was previously generated using the get_stat_* function)
                        )

In [60]:
vitalsign_k2[1:10,1:10]

time,temperature,sao2,heartrate,respiration,cvp,etco2,systemicsystolic,systemicdiastolic,systemicmean
1,0,-1.7119770393,4.427116222,0,-0.4452239288,0,-4.687323389,-4.6346846104,-4.8961439279
2,0,-1.7119770393,4.427116222,0,-0.4452239288,0,-4.687323389,-4.6346846104,-4.8961439279
3,0,-1.7119770393,3.9367135076,0,-0.4452239288,0,-4.687323389,-4.6346846104,-4.8961439279
4,0,-10.0797568505,3.9367135076,0,-0.4452239288,0,-4.687323389,-4.6346846104,-4.8961439279
5,0,-8.2202502258,3.9367135076,0,-0.4452239288,0,-4.687323389,-4.6346846104,-4.8961439279
6,0,-5.8958669449,3.9367135076,0,-0.4452239288,0,-4.687323389,-4.6346846104,-4.8961439279
7,0,-5.8958669449,3.7732459362,0,-0.4452239288,0,-4.687323389,-4.6346846104,-4.8961439279
8,0,-5.8958669449,3.7732459362,0,-0.4452239288,0,-4.687323389,-4.6346846104,-4.8961439279
9,0,-5.8958669449,3.7732459362,0,-0.4452239288,0,-4.687323389,-4.6346846104,-4.8961439279
10,0,-5.8958669449,3.6097783647,0,-0.4452239288,0,-4.687323389,-4.6346846104,-4.8961439279


In [336]:
vitalsign_k2 <- to_onehot(vitalsign_k2, # wide table that requires one_hot transformation
                         2:ncol(vitalsign_k2), # Index of the column that needs to be processed
                         1, # Index of the time column
                         d_custom$value_type,  # value type list
                         stat_vital # List of statistics (this was previously generated using the get_stat_* function)
                         )

#### **Feature engineering**

In [61]:
mask_k1 <- get_mask(vitalsign_k1, 3:ncol(vitalsign_k1), 1)
mask_k1 <- shape_as_onehot(mask_k1, 2:ncol(mask_k1), 1, get_type(stat_vital), stat_vital)

In [62]:
mask_k1[1:10,1:10]

time,temperature,sao2,heartrate,respiration,cvp,etco2,systemicsystolic,systemicdiastolic,systemicmean
1,0,0,0,0,0,0,0,0,0
2,0,1,1,0,0,0,0,0,0
3,0,1,1,0,0,0,0,0,0
4,0,1,1,0,0,0,0,0,0
5,0,1,1,0,0,0,0,0,0
6,0,1,1,0,0,0,0,0,0
7,0,1,1,0,0,0,0,0,0
8,0,1,1,0,0,0,0,0,0
9,0,1,1,0,0,0,0,0,0
10,0,1,1,0,0,0,0,0,0


In [63]:
delta_k1 <- get_deltamat(mask_k1, 2:ncol(mask_k1),1)

In [64]:
delta_k1[1:10,1:10]

time,temperature,sao2,heartrate,respiration,cvp,etco2,systemicsystolic,systemicdiastolic,systemicmean
1,0,0,0,0,0,0,0,0,0
2,1,1,1,1,1,1,1,1,1
3,2,1,1,2,2,2,2,2,2
4,3,1,1,3,3,3,3,3,3
5,4,1,1,4,4,4,4,4,4
6,5,1,1,5,5,5,5,5,5
7,6,1,1,6,6,6,6,6,6
8,7,1,1,7,7,7,7,7,7
9,8,1,1,8,8,8,8,8,8
10,9,1,1,9,9,9,9,9,9


#### **Output**

In [65]:
vitalsign_k2 <- as.data.frame(vitalsign_k2) %>% lapply(., as.numeric) %>% as.data.frame
mask_k1 <- as.data.frame(mask_k1) %>% lapply(., as.numeric) %>% as.data.frame
delta_k1 <- as.data.frame(delta_k1) %>% lapply(., as.numeric) %>% as.data.frame

## **Chapter 2: Seamless Integration of EMR-LIP with Other Frameworks**

If the output of other frameworks remains in the form of long or wide tables, EMR-LIP can further process the data by constructing an appropriate variable dictionary. This is a significant advantage of EMR-LIP and the reason why it does not provide standalone tools for data cleanup and clinical concept generation. These aspects tend to be highly heterogeneous across studies, making it challenging to offer a one-size-fits-all tool. However, starting from resampling, EMR-LIP provides a highly unified and encapsulated set of tools, promoting standardization in the preprocessing process.

To illustrate our approach, let's consider the processing of urine data. First, we utilize clinical concepts generated from https://github.com/MIT-LCP/mimic-code as a basis for our subsequent preprocessing steps.

In EMR systems, due to the presence of multiple variables representing urine output, it is sometimes necessary to aggregate these variables to create a single variable that represents the total urine volume. However, this can be a challenging process to standardize due to differences in databases, naming conventions, and other factors. Fortunately, many projects have already completed this step for us, allowing us to directly use their preprocessed results.

In [194]:
library(RPostgreSQL)
library(dplyr)

In [195]:
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "mimiciv", 
                 host = "127.0.0.1", port = 5432, 
                 user = "ljw", 
                 password = "123456")

In [196]:
icustays<-fread("/home/luojiawei/mimiciv/mimic-iv-2.2/icu/icustays.csv.gz",header=T,fill=T)

In [197]:
table_name <- "urine_output"
query <- paste0("SELECT * FROM ", schema_name, ".", table_name)
urine_output <- dbGetQuery(con, query)

In [198]:
urine_output <- merge(urine_output, icustays[,c("subject_id","hadm_id","stay_id")], by=c("stay_id"), all.x=T)
urine_output <- urine_output[urine_output$hadm_id %in% unique(admissions$hadm_id),]
urine_output <- merge(urine_output, admissions[,c("hadm_id","admittime")], by="hadm_id", all.x=T)

urine_output <- urine_output %>% filter(hadm_id %in% admissions$hadm_id,
                                                charttime >= admittime) %>% 
                                                mutate(charttime_r = as.numeric(difftime(charttime, admittime, units = "hours")))

In [199]:
urine_output[1:3,]

Unnamed: 0_level_0,hadm_id,stay_id,charttime,urineoutput,subject_id,admittime,charttime_r
Unnamed: 0_level_1,<int>,<int>,<dttm>,<dbl>,<int>,<dttm>,<dbl>
1,20000094,35605481,2150-03-02 15:33:00,0,14046553,2150-03-02,15.55
2,20000094,35605481,2150-03-02 15:46:00,25,14046553,2150-03-02,15.76667
3,20000094,35605481,2150-03-02 20:00:00,0,14046553,2150-03-02,20.0


In [209]:
d_custom <- data.frame(itemid = names(urine_output)[4], 
                       value_type=rep("num", 1), 
                       time_attribute=rep("single", 1),
                       acqu_type = rep("oper", 1),
                       agg_f = rep("sum", 1))
d_custom <- get_fill_method(d_custom)

In [210]:
d_custom

itemid,value_type,time_attribute,acqu_type,agg_f,fill,fillall
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
urineoutput,num,single,oper,sum,zero,zero


In [211]:
stat_uo <- get_stat_wide(urine_output, d_custom$itemid, d_custom$value_type)

In [212]:
# Without loss of generality, we chose admission with hadm_id 20000094 as our demo object
uo_k <- urine_output[which(urine_output$hadm_id == 20000094), ]

In [213]:
uo_k[1:3,]

Unnamed: 0_level_0,hadm_id,stay_id,charttime,urineoutput,subject_id,admittime,charttime_r
Unnamed: 0_level_1,<int>,<int>,<dttm>,<dbl>,<int>,<dttm>,<dbl>
1,20000094,35605481,2150-03-02 15:33:00,0,14046553,2150-03-02,15.55
2,20000094,35605481,2150-03-02 15:46:00,25,14046553,2150-03-02,15.76667
3,20000094,35605481,2150-03-02 20:00:00,0,14046553,2150-03-02,20.0


In [228]:
time_range <- 15:20

In [229]:
uo_k$timecol <- rep(NA, nrow(uo_k))
uo_k1 <- resample_process_wide(uo_k, 
                                d_custom$itemid,
                                d_custom$value_type,
                                d_custom$agg_f,
                                time_range, 
                                time_col1 = "charttime_r", 
                                time_col2 = "timecol",
                                time_window = 1,
                                keepNArow = T)

In [230]:
uo_k1

time,keep,urineoutput
15,1,
16,1,35.0
17,1,25.0
18,1,5.0
19,0,
20,1,0.0


## **Chapter 3: Focusing on the Time Attribute of the Variable**

For certain variables, their variable values represent the total or average amount over a specific time range. In such cases, if the resampling interval doesn't completely overlap with the variable's time range, it could lead to bias. Therefore, it's necessary to correct this bias by considering the length of the intersection of intervals. Let's first examine this type of bias.

We'll use the dosage of vasodilators as an example to explain this.

In [193]:
library(RPostgreSQL)
library(dplyr)

In [72]:
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, dbname = "mimiciv", 
                 host = "127.0.0.1", port = 5432, 
                 user = "ljw", 
                 password = "123456")

In [None]:
icustays<-fread("/home/luojiawei/mimiciv/mimic-iv-2.2/icu/icustays.csv.gz",header=T,fill=T)

In [154]:
schema_name <- "mimiciv_derived"
table_name <- "vasoactive_agent"
query <- paste0("SELECT * FROM ", schema_name, ".", table_name)
vasoactive_agent <- dbGetQuery(con, query)
vasoactive_agent <- merge(vasoactive_agent, icustays[,c("subject_id","hadm_id","stay_id")], by=c("stay_id"), all.x=T)
vasoactive_agent <- vasoactive_agent[vasoactive_agent$hadm_id %in% unique(admissions$hadm_id),]
vasoactive_agent <- merge(vasoactive_agent, admissions[,c("hadm_id","admittime")], by="hadm_id", all.x=T)

In [155]:
vasoactive_agent <- vasoactive_agent %>% filter(hadm_id %in% admissions$hadm_id,
                                                starttime >= admittime) %>% 
                                                mutate(starttime_r = as.numeric(difftime(starttime, admittime, units = "hours")),
                                                endtime_r = as.numeric(difftime(endtime, admittime, units = "hours")))

In [156]:
vasoactive_agent[1:3,]

Unnamed: 0_level_0,hadm_id,stay_id,starttime,endtime,dopamine,epinephrine,norepinephrine,phenylephrine,vasopressin,dobutamine,milrinone,subject_id,admittime,starttime_r,endtime_r
Unnamed: 0_level_1,<int>,<int>,<dttm>,<dttm>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dttm>,<dbl>,<dbl>
1,20000094,35605481,2150-03-02 22:12:00,2150-03-02 22:20:00,,,0.3040906,,,7.520021,,14046553,2150-03-02,22.2,22.33333
2,20000094,35605481,2150-03-03 04:49:00,2150-03-03 05:13:00,,,,,,,,14046553,2150-03-02,28.81667,29.21667
3,20000094,35605481,2150-03-02 16:34:00,2150-03-02 16:48:00,,,0.1801096,,,5.004403,,14046553,2150-03-02,16.56667,16.8


In [157]:
names(vasoactive_agent)

In [183]:
d_custom <- data.frame(itemid = names(vasoactive_agent)[5:11], 
                       value_type=rep("num", 11-5+1), 
                       time_attribute=rep("interval", 11-5+1),
                       acqu_type = rep("oper", 11-5+1),
                       agg_f = rep("sum_w", 11-5+1))
d_custom <- get_fill_method(d_custom)

In [184]:
d_custom

itemid,value_type,time_attribute,acqu_type,agg_f,fill,fillall
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
dopamine,num,interval,oper,sum_w,zero,zero
epinephrine,num,interval,oper,sum_w,zero,zero
norepinephrine,num,interval,oper,sum_w,zero,zero
phenylephrine,num,interval,oper,sum_w,zero,zero
vasopressin,num,interval,oper,sum_w,zero,zero
dobutamine,num,interval,oper,sum_w,zero,zero
milrinone,num,interval,oper,sum_w,zero,zero


In [160]:
stat_vaso <- get_stat_wide(vasoactive_agent, d_custom$itemid, d_custom$value_type)

In [185]:
# Without loss of generality, we chose admission with hadm_id 20000094 as our demo object
vaso_k <- vasoactive_agent[which(vasoactive_agent$hadm_id == 20000094), ]

In [186]:
time_range <- 25:30

In [187]:
vaso_k1 <- resample_process_wide(vaso_k, 
                                d_custom$itemid,
                                d_custom$value_type,
                                d_custom$agg_f,
                                time_range, 
                                time_col1 = "starttime_r", 
                                time_col2 = "endtime_r",
                                time_window = 1,
                                keepNArow = T)

In [188]:
vaso_k1

time,keep,dopamine,epinephrine,norepinephrine,phenylephrine,vasopressin,dobutamine,milrinone
25,1,0,0,0.42102027752593,0,0.0,8.35164206007293,0
26,1,0,0,0.815914641362705,0,0.0,11.9549792926466,0
27,1,0,0,0.918775301554478,0,0.0,19.0245394311919,0
28,1,0,0,1.58536047781416,0,6.36279031842253,30.5360068182337,0
29,1,0,0,1.03336327543208,0,4.83720901400544,19.900854821353,0
30,1,0,0,0.170902387859921,0,0.799999952316283,3.29129522045453,0


In [189]:
# Now, let's change agg_f in d_custom a little bit and see the difference.
d_custom$agg_f <- "sum"

In [191]:
vaso_k1 <- resample_process_wide(vaso_k, 
                                d_custom$itemid,
                                d_custom$value_type,
                                d_custom$agg_f,
                                time_range, 
                                time_col1 = "starttime_r", 
                                time_col2 = "endtime_r",
                                time_window = 1,
                                keepNArow = T)

In [192]:
vaso_k1

time,keep,dopamine,epinephrine,norepinephrine,phenylephrine,vasopressin,dobutamine,milrinone
25,1,0,0,0.867078400915488,0,0.0,15.0358555838466,0
26,1,0,0,1.02615176001564,0,0.0,15.0358555838466,0
27,1,0,0,1.53799366671592,0,0.0,29.1452067904175,0
28,1,0,0,2.05070083029568,0,7.19999957084655,39.4991543143988,0
29,1,0,0,1.53812149073929,0,7.19999957084655,29.6216569840908,0
30,1,0,0,0.512707163579762,0,2.39999985694885,9.8738856613636,0


Obviously, if the intersection of time frames is not taken into account, then resampling will overestimate the amount of medication used.

## **Chapter 4: Runtime Testing of EMR-LIP**

In [3]:
# We need to use the parallel computation tools from the parallel package to generate data for multiple samples in parallel.
library(parallel)

In [17]:
chartevents<-fread("/home/luojiawei/mimiciv/mimic-iv-2.2/icu/chartevents.csv.gz",header=T,fill=T)
d_items<-fread("/home/luojiawei/mimiciv/mimic-iv-2.2/icu/d_items.csv.gz",header=T,fill=T)
admissions<-fread("/home/luojiawei/mimiciv/mimic-iv-2.2/hosp/admissions.csv.gz",header=T,fill=T)

In [5]:
chartevents[1:4,]

subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
<int>,<int>,<int>,<int>,<dttm>,<dttm>,<int>,<chr>,<dbl>,<chr>,<int>
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0


In [6]:
d_items[1:4,]

itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
<int>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<dbl>
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,


In [7]:
d_items <- d_items[d_items$category %in% c('Routine Vital Signs','Respiratory', 'Labs','Alarms') & d_items$linksto %in% c("chartevents") & !(d_items$param_type %in% c("Text", "Date and time")), ]

In [8]:
d_items <- d_items[d_items$itemid %in% unique(chartevents$itemid),]

In [9]:
d_custom_all <- data.frame(itemid = d_items$itemid, value_type = d_items$param_type)

In [10]:
unique(d_custom_all$value_type)

In [11]:
d_custom_all$value_type[d_custom_all$value_type %in% c("Numeric","Numeric with tag")] <- "num"
d_custom_all$value_type[d_custom_all$value_type %in% c("Checkbox")] <- "bin"

In [12]:
d_custom_all$time_attribute <- "single"
d_custom_all$acqu_type <- "obs"
d_custom_all$agg_f <- "median"
d_custom_all <- get_fill_method(d_custom_all)

In [95]:
d_custom <- d_custom_all[1:100,]

In [96]:
dim(d_custom)

In [97]:
chart <- chartevents[which(chartevents$itemid %in% d_custom$itemid), ]

In [98]:
chart <- merge(chart, admissions[,c("hadm_id", "admittime")], by="hadm_id", all.x=T)
chart$admittime <- as.POSIXct(chart$admittime)
chart$charttime_r <- as.numeric(difftime(chart$charttime, chart$admittime, unit="hour"))

In [99]:
chart[1:4,]

hadm_id,subject_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning,admittime,charttime_r
<int>,<int>,<int>,<int>,<dttm>,<dttm>,<int>,<chr>,<dbl>,<chr>,<int>,<dttm>,<dbl>
20000094,14046553,35605481,2314,2150-03-02 15:19:00,2150-03-02 15:31:00,220045,107,107,bpm,0,2150-03-02,15.31667
20000094,14046553,35605481,2314,2150-03-02 15:19:00,2150-03-02 15:31:00,220210,19,19,insp/min,0,2150-03-02,15.31667
20000094,14046553,35605481,2314,2150-03-02 15:25:00,2150-03-02 15:25:00,220046,130,130,bpm,0,2150-03-02,15.41667
20000094,14046553,35605481,2314,2150-03-02 15:25:00,2150-03-02 15:25:00,220047,50,50,bpm,0,2150-03-02,15.41667


In [100]:
chart <- remove_extreme_value_long(chart, d_custom$itemid, d_custom$value_type, "itemid", "value")

In [101]:
stat_chart <- get_stat_long(chart, d_custom$itemid, d_custom$value_type, "itemid", "value")

In [102]:
all_hadmid <- unique(chart$hadm_id)

In [103]:
length(all_hadmid)

We can encapsulate the various components of the EMR-LIP framework into a single function, with the input being the identification number of a sample. By doing so, we can generate the data needed for modeling on an end-to-end basis for each sample.

In [106]:
emr_lip_fun <- function(k) {
    id_k <- all_hadmid[k]
    chart_k <- chart[chart$hadm_id == id_k,]
    chart_k$timecol <- NA
    # resample the sequence
    chart_k1 <- resample_data_long(chart_k,
                                d_custom$itemid,
                                d_custom$value_type, 
                                d_custom$agg_f, 
                                1:48,
                                "itemid",
                                "value",
                                "charttime_r",
                                "timecol",
                                1,keepNArow = T)
    # data imputation
    chart_k2 <- fill(chart_k1, 3:ncol(chart_k1), 1, d_custom$value_type, d_custom$fill, d_custom$fillall, stat_chart)
    # normalization
    chart_k2 <- norm_num(chart_k2, 2:ncol(chart_k2), 1, d_custom$value_type, stat_chart)
    # feature engineering
    mask_k <- get_mask(chart_k1, 3:ncol(chart_k1), 1)
    delta_k <- get_deltamat(mask_k, 2:ncol(mask_k), 1)
    # output
    chart_k2 <- chart_k2 %>% as.data.frame %>% lapply(., as.numeric) %>% as.data.frame
    mask_k <- mask_k %>% as.data.frame %>% lapply(., as.numeric) %>% as.data.frame
    delta_k <- delta_k %>% as.data.frame %>% lapply(., as.numeric) %>% as.data.frame
    return(list(chart_k2, mask_k, delta_k))
}


In [107]:
process_data <- function(k, root_path) {
    id_k<-all_hadmid[k]
    folder_path<-file.path(root_path, id_k)
    create_dir(folder_path, F)
    datas <- emr_lip_fun(k)
    fwrite(datas[[1]], file=file.path(folder_path, "X.csv"), row.names=F)
    fwrite(datas[[2]], file=file.path(folder_path, "M.csv"), row.names=F)
    fwrite(datas[[3]], file=file.path(folder_path, "D.csv"), row.names=F)
}

In [120]:
root_path <- "/home/luojiawei/Benchmark_project_data/mimiciv_data/patient_folders_test/"
create_dir(root_path, T)

[1] "/home/luojiawei/Benchmark_project_data/mimiciv_data/patient_folders_test/ removed"
[1] "/home/luojiawei/Benchmark_project_data/mimiciv_data/patient_folders_test/ created"


In [121]:
start_time <- Sys.time()
results <- mclapply(1:20000, 
                function(x) {
                  result <- tryCatch({
                    process_data(x, root_path = root_path)
                  }, error = function(e) {
                    print(e)
                    print(x)
                  })
                  return(result)
                }, mc.cores = detectCores())
end_time <- Sys.time()

In [122]:
end_time - start_time

Time difference of 5.606911 mins