# Preprocessing

Some preprocessing steps.

But not the ones mentioned by Patrick Rockenschaub's paper.

In [4]:
import pandas as pd
import numpy as np
from pathlib import Path

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


---
## Setting dtype of time and `float32`
The dtype of the 'time' column is automatically set to `Timedelta`.  
Hence, when loading DataFrame and converting it to numpy, the numpy array will not have a single dtype. Instead the numpy array will be of dtype 'object'.  
This can create many complications. Therefore, we change it to float.

In addition, using `float32` over `float64` is memory saving and the precision is good enough. So changing the dtypes to `float32` if the current dtype is `float64`. Parquet is able in preserving the dtype, as it is a binary format (CSV format cannot preserve dtype).

In [17]:
# path to data
miiv_path_p = '~/Documents/data/ts/miiv/fully_observed/miiv_ts_wide.parquet'
# read in data
df = pd.read_parquet(miiv_path_p)
# change dtype of time column
df['time'] = df['time'].apply(lambda x: x.total_seconds() / 60 / 60)
# change all float64 dtypes to float 32
float64_columns = df.select_dtypes(include='float64').columns
df[float64_columns] = df[float64_columns].astype('float32')
# save data
df.to_parquet(miiv_path_p)
print(df.shape)
df.head()

Unnamed: 0,id,time,label,alb,alp,alt,ast,be,bicar,bili,...,phos,plt,po2,ptt,resp,sbp,temp,tnt,urine,wbc
0,30000153,0.0,False,,,,,,,,...,,,,,14.0,124.5,36.0,,280.0,
1,30000153,1.0,False,,,,,-3.0,,,...,,,242.0,,16.0,141.0,37.277779,,45.0,
2,30000153,2.0,False,,,,,,,,...,,,,,,,,,50.0,
3,30000153,3.0,False,,,,,-4.0,19.0,,...,3.1,173.0,215.0,25.299999,14.0,116.0,37.5,,50.0,17.0
4,30000153,4.0,False,,,,,,,,...,,,,,20.0,111.0,,,45.0,
