<img src="figure/clairvoyance_logo.png">

# Clairvoyance: Time-series prediction

## ML-AIM (http://vanderschaar-lab.com/)

This notebook describes the user-guide of a time-series predictions application using Clairvoyance framework. Time-series prediction is defined as following: utilize both static and temporal features to predict certain labels in the future. For instance, using the temporal data (vitals, lab tests) and static data (demographic information), we predict 'whether the patient will die at the end of hospital stay' or 'whether the patient will get ventilator after 4 hours'. 
- One-shot prediction: Predict the patient state at the end of the time-series at certain time point.
  - Example: Predict patient mortality (at the end of the hospital stays) after 24 hours from the admission.
- Rolling window (online) prediction:
  - Example: Predict ventilator after 24 hours from the current time point.
 
<img src="figure/time-series-prediction-definition.png">

To run this tutorial, you need:
### Temporal and static datasets for training and testing

If users come with their own temporal and static datasets for training and testing, the users should save those files as 'data_name_temporal_train_data_eav.csv.gz', 'data_name_static_train_data.csv.gz', 'data_name_temporal_test_data_eav.csv.gz', 'data_name_static_test_data.csv.gz' in '../datasets/data/data_name/' directory.


### Prerequisite
Clone https://github.com/jsyoon0823/time-series-automl.git to the current directory.

## Time-series prediction pipeline summary

<img src="figure/time-series-prediction-block-diagram.png">

### Step 1: Load dataset
  - Extract csv files from the original raw datasets in ../datasets/data/data_name/ directory.  
  
### Step 2: Preprocess dataset
  - Preprocessing the raw data using various filters such as (1) replacing negative values to NaN, (2) do one-hot encidng for certain features, (3) do normalization.  
  
### Step 3: Define problem
  - Set the time-series prediction problem that we want to solve. Set the problem (whether it is one-shot or online prediction), set the label, set the maximum sequence length, and set the treatment features. We also define the metric for evaluation and the task itself (whether classification or regression).

### Step 4: Impute dataset
  - Impute missing values in the preprocessed static and temporal datasets and return complete datasets.
  
### Step 5: Feature selection
  - Select the relevant static and temporal features to the labels. You can skip the feature selection (set feature selection method = None).
  
### Step 6: Time-series model fit and predict
  - After finishing the data preparation, we define the predictive models and train the model using the training dataset. After training, we use the trained model to predict the labels of the testing dataset.
  
### Step 7: Estimate uncertainty
  - Estimate uncertainty of the predictor models and returns the uncertainty of the predictions.

### Step 8: Interpret predictions
  - Interpret the trained predictor model and return the instance-wise feature and temporal importance.

### Step 9: Visualize results
  - Visualize the various results such as performance, predictions, uncertainties, and interpretations.

## Step 0: Import necessary packages

Import necessary packages for the entire tutorials.

In [1]:
# Necessary packages
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import warnings
warnings.filterwarnings('ignore')
import sys
sys.path.append('../')

from utils import PipelineComposer

## Step 1: Load dataset

Extract temporal and static datasets from 'data_name_temporal_train_data_eav.csv.gz', 'data_name_static_train_data.csv.gz', 'data_name_temporal_test_data_eav.csv.gz', 'data_name_static_test_data.csv.gz' in '../datasets/data/data_name/' directory.

- CSVLoader: Load csv files from the original raw datasets in ../datasets/data/data_name/ directory.
- file_names: mimic in this tutorial.

In [2]:
from datasets import CSVLoader

# Define data name
data_name = 'mimic'
# Define data dictionary
data_directory = '../datasets/data/'+data_name + '/' + data_name + '_'

# Load train and test datasets
# data_loader_training = CSVLoader(static_file=data_directory + 'static_train_data.csv.gz',
#                                  temporal_file=data_directory + 'temporal_train_data_eav.csv.gz')

data_loader_training = CSVLoader(static_file=data_directory + 'static_test_data.csv.gz',
                                 temporal_file=data_directory + 'temporal_test_data_eav.csv.gz')


data_loader_testing = CSVLoader(static_file=data_directory + 'static_test_data.csv.gz',
                                temporal_file=data_directory + 'temporal_test_data_eav.csv.gz')

dataset_training = data_loader_training.load()
dataset_testing = data_loader_testing.load()

print('Finish data loading.')

Finish data loading.


## Step 2: Preprocess dataset

Preprocess the raw data using multiple filters. In this tutorial, we replace all the negative values to NaN (using NegativeFilter), do one-hot encoding on 'admission_type' feature (using OneHotEncoder), and do MinMax Normalization (using Normalization). Preprocessing is done for both training and testing datasets. 
  - NegativeFilter: Replace negative values to NaN
  - OneHotEncoder: One hot encoding certain features
    - one_hot_encoding: input features that need to be one-hot encoded
  - Normalization (3 options): MinMax, Standard, None

In [3]:
from preprocessing import FilterNegative, OneHotEncoder, Normalizer

# (1) filter out negative values
negative_filter = FilterNegative()
# (2) one-hot encode categorical features
one_hot_encoding = 'admission_type'
onehot_encoder = OneHotEncoder(one_hot_encoding_features=[one_hot_encoding])
# (3) Normalize features: 3 options (minmax, standard, none)
normalization = 'minmax'
normalizer = Normalizer(normalization)

# Data preprocessing
filter_pipeline = PipelineComposer(negative_filter, onehot_encoder, normalizer)

dataset_training = filter_pipeline.fit_transform(dataset_training)
dataset_testing = filter_pipeline.transform(dataset_testing)

print('Finish preprocessing.')

Finish preprocessing.


## Step 3: Define problem   

Set the time-series prediction problem that we want to solve. Set the problem (whether it is one-shot or online prediction), set the label, set the maximum sequence length, and set the treatment features. We also define the metric for evaluation and the task itself (whether classification or regression). In this tutorial, we predict whether the patients will get ventilator after 4 hours (online setting).
  - problem: 'one-shot'(one time prediction) or 'online'(rolling window prediction)
    - 'one-shot': one time prediction at the end of the time-series 
    - 'online': preditcion at every time stamps of the time-series
  - max_seq_len: maximum sequence length of time-series sequence
  - label_name: the column name for the label(s)
  - treatment: the column name for treatments
  - window: x-hour ahead prediction.
  
  - other parameters:
    - metric_name: auc, apr, mse, mae
    - task: classification or regression

In [4]:
from preprocessing import ProblemMaker

# Define parameters
problem = 'online'
max_seq_len = 24
label_name = 'ventilator'
treatment = None
window = 4

# Define problem 
problem_maker = ProblemMaker(problem=problem, label=[label_name],
                             max_seq_len=max_seq_len, treatment=treatment, window = window)

dataset_training = problem_maker.fit_transform(dataset_training)
dataset_testing = problem_maker.fit_transform(dataset_testing)

# Set other parameters
metric_name = 'auc'
task = 'classification'

metric_sets = [metric_name]
metric_parameters =  {'problem': problem, 'label_name': [label_name]}

print('Finish defining problem.')

100%|██████████| 4610/4610 [00:07<00:00, 585.39it/s]
100%|██████████| 4610/4610 [00:07<00:00, 628.99it/s]
100%|██████████| 4610/4610 [00:08<00:00, 570.11it/s]
100%|██████████| 4610/4610 [00:07<00:00, 581.33it/s]
100%|██████████| 4610/4610 [00:07<00:00, 628.70it/s]
100%|██████████| 4610/4610 [00:08<00:00, 569.97it/s]


Finish defining problem.


# Pipeline optimization

In [6]:
import automl.pipeline_opt

In [7]:
static_imputation_model_list = ['mean', 'median', 'mice', 'missforest', 'knn']
temporal_imputation_model_list = ['mean','median', 'linear','quadratic', 'cubic', 'spline']
static_feature_selection_model_list = [(None, None)]
temporal_feature_selection_model_list = [(None, None)]
model_name_list = ['gru', 'lstm', 'rnn']


In [8]:
auto_opt = automl.pipeline_opt.PipelineOpt(static_imputation_model_list, temporal_imputation_model_list, static_feature_selection_model_list,
               temporal_feature_selection_model_list, model_name_list, dataset_training, dataset_testing, task, metric_name, metric_parameters)

100%|██████████| 4610/4610 [00:56<00:00, 81.14it/s]
100%|██████████| 4610/4610 [00:56<00:00, 81.94it/s]
100%|██████████| 4610/4610 [02:33<00:00, 30.00it/s]
100%|██████████| 4610/4610 [02:33<00:00, 30.06it/s]
100%|██████████| 4610/4610 [02:33<00:00, 29.96it/s]
100%|██████████| 4610/4610 [02:33<00:00, 30.02it/s]
100%|██████████| 4610/4610 [02:35<00:00, 29.69it/s]
100%|██████████| 4610/4610 [02:34<00:00, 29.92it/s]
100%|██████████| 4610/4610 [02:34<00:00, 29.91it/s]
100%|██████████| 4610/4610 [02:33<00:00, 30.03it/s]


In [9]:
best_model, best_obj = auto_opt.run_opt(steps=5)

100%|██████████| 4610/4610 [00:05<00:00, 837.58it/s]
100%|██████████| 4610/4610 [00:05<00:00, 838.92it/s]
100%|██████████| 4610/4610 [00:05<00:00, 825.65it/s]
100%|██████████| 4610/4610 [00:05<00:00, 832.64it/s]


In [10]:
best_model

['mice', 'linear', (None, None), (None, None), 'lstm']

In [11]:
best_obj

-0.9405225519712068