# Examples on how to prepare predictors and targets

This notebook explains the data preparation process used in my project on the MIMIC III database.

At the end of the notebook I demonstrate the workflow of training an encoder and a model and applying it to unknow test data.

To make the notebook run, it is necessary to load the csv files of the MIMIC III demo into '../data/demo_data/all'. To run the last part, it is necessary to load the full database, and split it with train_test_split.py

## Data Preparation

Get the names of the csv-tables in a given folder.

In [1]:
from prep import PrepareDataV2
pre = PrepareDataV2('demo_data', '../data/demo_data/all')
pre.get_table_names()

['admissions', 'prescriptions', 'patients']

Read the csv-files and store them to pickle files. (Loading the preprocessed pickle files is much faster than reading csv-files.)

In [2]:
pre.read_table('patients', store_pickle=True)
pre.read_tables(['prescriptions', 'admissions'], store_pickle=True)

In [3]:
type(pre.admissions)

pandas.core.frame.DataFrame

In [4]:
pre.admissions.head(0)

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DEATHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,INSURANCE,LANGUAGE,RELIGION,MARITAL_STATUS,ETHNICITY,EDREGTIME,EDOUTTIME,DIAGNOSIS,HOSPITAL_EXPIRE_FLAG,HAS_CHARTEVENTS_DATA


The data is kept in Pandas DataFrames. The rows are not shown here for reasons of data protection.
Pickled data can be accessed:

In [5]:
pre2 = PrepareDataV2('demo_data_2', '../data/demo_data/all')
pre2.load_table_pickle('admissions')
pre2.load_table_pickle_list(['prescriptions', 'patients'])

In [6]:
pre2.prescriptions.head(0)

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ICUSTAY_ID,STARTDATE,ENDDATE,DRUG_TYPE,DRUG,DRUG_NAME_POE,DRUG_NAME_GENERIC,FORMULARY_DRUG_CD,GSN,NDC,PROD_STRENGTH,DOSE_VAL_RX,DOSE_UNIT_RX,FORM_VAL_DISP,FORM_UNIT_DISP,ROUTE


## How to prepare predictors

In [7]:
from prep import PatientEncoder
pat_enc = PatientEncoder()

In [8]:
pat_enc.fit(pre.prescriptions, pre.admissions, pre.patients)

In [9]:
x, hadm_id_map = pat_enc.transform(pre.prescriptions, pre.admissions, pre.patients)

In [10]:
x.shape

(129, 839)

The predictor variable contains feature vectors of 129 patients.
More precisely, it contains data of 129 hospital admissions, where a single patient can be admitted several times. The default settings of the PatientEncoder leads to 839 features. The feature vector of each patient includes information on
- prescribed drugs within the first two days of stay
- diagnoses at admission
- admission type
- admission location
- health insurance
- marital status

In [11]:
print(x.min())
print(x.max())
print(type(x))

0.0
1.0
<class 'numpy.ndarray'>


The data is already normalized, and its data type is compatible already with scikit-learn.

In [12]:
type(hadm_id_map)

dict

The hadm_id_map is a dictionary that connects the "hospital admission IDs" to the rows of the predictor matrix. This relation is important to prepare a target vector with correct order. Or to understand the result of a prediction for a batch of patients.

### A bit more details on the PatientEncoder

The parameters of the PatientEncoder are described in the comments below a bit closer.

In [13]:
pat_enc2 = PatientEncoder(duration_from_admission_in_hours=24,  # Only prescription from first 24 hours observed
                 max_number_of_drug_features=None,                # One-hot encode all drugs
                 max_num_of_diagnoses=1000,                       # Encode maximal 1000 most frequent diagnoses 
                 min_frequency_of_diag=3,                         # Encode only diagnoses that appear >=3 times
                 include_age_at_admission=True,          
                 max_number_admission_type_features=100,         
                 min_frq_admission_type_features=3,             
                 max_number_admission_location_features=100,
                 min_frq_admission_location_features=3,
                 max_number_insurance_features=100,
                 min_frq_insurance_features=3,
                 min_frq_marital_features=100,
                 max_number_marital_features=3)

The PatientEncoder is based on
- DrugEncoderV2
- FeatureEncoder
- utils.get_age_at_admission

The DrugEncoder is a class, especially for encoding prescribed drugs. Each prescriptions is noted as a row in the prescriptions table (330 Mio. prescriptions in total in the MIMIC database).

The FeatureEncoder is meant to encode a single column from a table, e.g. diagnosis or admission_type.

utils.get_age_at_admission encodes the age at admission. In the provided MIMIC data, the date of birth of patients with age >89 years at first admission has been set to 300 years before their first admission. This is meant to prevent re-identification of especially old patients. The utils.get_age_at_admission function sets the "age at admission" to a reasonable value.

## How to prepare targets

For organizational planning, it is helpful to know how long a patient will stay. The "hospital admission ID" is needed here to order the target elements correctly with respet to the predictors.

In [14]:
import prep.utils as utils
y1 = utils.get_duration_of_stay_in_days(pre.admissions, hadm_id_map)

In [15]:
print(type(y1))
print(y1[:3])

<class 'numpy.ndarray'>
[ 8.8375     13.85208333  2.65069444]


The prepared targets are of a data type that is compatible with scikit-learn, and easy to cast to a torch tensor.
We see that there are patients that remained 8.8, 13.9 and 2.7 days in hospital.

It is also useful to know, how critical the status of a patient is.
We can determine this status by predicting, whether the patient will alive after 14 days from admission.

In [16]:
y2 = utils.patient_alive_after_duration(pre.admissions, pre.patients, hadm_id_map, 14)

In [17]:
print(type(y2))
y2[5:12]

<class 'numpy.ndarray'>


array([1., 1., 1., 1., 1., 1., 0.])

We note that most of the exemplarily chosen patients are alive 14 days after their admission, while one patient passed away.

## Make Predictions with scikit-learn

I will now show an example on how to train a model and make a prediction with scikit-learn. To do so, I will use training and test data, which is lying in two folders in csv-files. How it was split is explained elsewhere.

In [18]:
tables = ['admissions', 'patients', 'prescriptions']
data_train = PrepareDataV2(name='train', root='../data/train') 
data_train.load_table_pickle_list(tables)

Now, a PatientEncoder is fitted to the training data.
The encoder learns the most frequent prescriptions and diagnoses from the training data.
Next, the training data is transformed.

In [19]:
pat_enc = PatientEncoder()
pat_enc.fit(data_train.prescriptions,
            data_train.admissions,
            data_train.patients)

x_train, hadm_id_map_train = pat_enc.transform(data_train.prescriptions,
                                               data_train.admissions,
                                               data_train.patients)

Also the target of the training data is prepared:

In [20]:
y_train = utils.get_duration_of_stay_in_days(data_train.admissions, hadm_id_map_train)

Now, we can train a model.

In [21]:
from sklearn.linear_model import LinearRegression
regr = LinearRegression()
regr.fit(x_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

The encoder as well as our model are ready now and we can make predictions on unknown test data

In [22]:
data_test = PrepareDataV2(name='test', root='../data/test') 
data_test.load_table_pickle_list(tables)

x_test, hadm_id_map_test = pat_enc.transform(data_test.prescriptions,
                                             data_test.admissions,
                                             data_test.patients)
y_test = utils.get_duration_of_stay_in_days(data_test.admissions, hadm_id_map_test)

In [23]:
y_test_predict = regr.predict(x_test)