# Data-Derived Contextual Motif Example

### Introduction
This notebook serves as a walkthrough for the data-derived contextual motif tools described in https://arxiv.org/abs/1703.02144

In [1]:
import numpy as np
import sklearn
import sklearn.cluster
from sklearn.externals import joblib
from tqdm import tqdm

from lib import glucose_processing as GP
from lib import data_derived_motifs as ddm

In [24]:
# Your data here
# Random data to demonstrate how the pipeline runs
patients = {}
for i in range(1,10):
    patients[i] = {'glucose':[np.random.randint(0,3,1152)*60+69 for j in range(13)]}

### Step 1. Discover Motifs
To discover motifs, we use a simplified version of the MDLats pipeline, explained in our paper. The full MDLats approach is presented here: http://ieeexplore.ieee.org/document/7056438/

In [3]:
# We must predefine relevant motif discovery parameters
motif_length = 8
stride_length = 1 
motif_min_count = 5
n_motifs = 15

# These parameters inform the data preprocessing and 
center = True
scale = True
n_letters = 30

# problem specific parameters
hypo_thresh = 1
hyper_thresh = 3

# first, we divide data into proto-motif candidates
cands, chunkdiv = ddm.sess_to_cand(patients, motif_length, stride_length)
# second, we preprocess the candidates to center and scale if set
cands = ddm.cand_preprocessing(cands, center, scale)
# we represent continuous waveform data with a variant of SAX, explained in our paper
sax_dat, div = ddm.balancing_SAX(cands, n_letters)
# the discritized candidates are transformed into motif prototypes
motif_proto, motif_proto_set = ddm.protomotifs(sax_dat, motif_min_count)

# we cluster the prototypes to get n_motif maximally distinct motifs
kmini = sklearn.cluster.MiniBatchKMeans(n_clusters=n_motifs, batch_size=400)
klabels = kmini.fit_predict(motif_proto)

100%|██████████| 133848/133848 [00:09<00:00, 14474.10it/s]


35692.8
21 30


### Convert data into contextual motif representation
Now that we have our baseline motifs, we look for contextual motifs. This is accomplished by appending each motif representation with an indicator variable for the context under which it occurred. 

In [42]:
# Each call to this cell generates a set of contextual motifs
# context_func determines context, contexts used in paper: no_context, trend, and hmm
# note that hmm relies on a previously trained HMM, trained with pomegranate
max_interp_length = 2
max_missing_data_x = 144 # allow half of a day to be missing for input
max_missing_data_y = 144 # allow half of a day to be missing for label
context_func = ddm.dummy_context
precompute_func = None
context_size = 4
X, labels, y_hypo, y_hyper, y_event = ddm.get_days_and_events(patients, 
                                                              max_interp_length, 
                                                              max_missing_data_x, 
                                                              max_missing_data_y)
day_motif= []
for i in tqdm(range(len(X))):
    try:
        day_motif.append(ddm.day_to_motif_vec(X[i], 
                                      motif_length, 
                                      stride_length, 
                                      div, 
                                      motif_proto_set, 
                                      n_motifs, 
                                      kmini, 
                                      (context_func, context_size, precompute_func), 
                                      center, 
                                      scale))
    except:
        print(i)
        raise


  0%|          | 0/351 [00:00<?, ?it/s][A
  1%|          | 3/351 [00:00<00:14, 24.14it/s][A
  2%|▏         | 6/351 [00:00<00:14, 24.50it/s][A
  3%|▎         | 9/351 [00:00<00:14, 24.21it/s][A
  3%|▎         | 12/351 [00:00<00:13, 24.53it/s][A
  4%|▍         | 15/351 [00:00<00:13, 24.47it/s][A
  5%|▌         | 18/351 [00:00<00:13, 24.62it/s][A
  6%|▌         | 21/351 [00:00<00:13, 24.58it/s][A
  7%|▋         | 24/351 [00:00<00:13, 24.81it/s][A
  8%|▊         | 27/351 [00:01<00:13, 24.31it/s][A
  9%|▉         | 31/351 [00:01<00:12, 25.67it/s][A
 10%|▉         | 34/351 [00:01<00:12, 25.22it/s][A
 11%|█         | 37/351 [00:01<00:12, 25.41it/s][A
 11%|█▏        | 40/351 [00:01<00:12, 25.46it/s][A
 12%|█▏        | 43/351 [00:01<00:12, 25.40it/s][A
 13%|█▎        | 46/351 [00:01<00:12, 25.41it/s][A
 14%|█▍        | 49/351 [00:01<00:11, 25.40it/s][A
 15%|█▍        | 52/351 [00:02<00:11, 24.97it/s][A
 16%|█▌        | 55/351 [00:02<00:11, 25.79it/s][A
 17%|█▋        | 58/351

In [43]:
# example of one days motif representation
day_motif[0]

array([  7.,   8.,   8.,   8.,   4.,  11.,   1.,  18.,  13.,   7.,   5.,
        17.,   8.,  16.,   9.,   5.,   0.,  15.,   0.,  10.,   9.,   5.,
         0.,  13.,   0.,   2.,   4.,   7.,   0.,  13.,   0.,   5.,   4.,
         0.,   0.,   5.,   0.,   0.,   6.,   5.,   0.,   5.,   0.,   4.,
         3.,   0.,   0.,   0.,   0.,   0.,   0.,   7.,   0.,   0.,   0.,
         8.,   0.,   0.,   0.,   5.])

### Evaluation Scheme
The following demonstrates how we evaluate the quality of our motif representation. We use the learned motif representation as input to a logistic regression model and test predictive performance. We tune hyperparameters for the ML model using random search over a specified number of splits.

In [48]:
# for paper exepriments, performed random search over the following parameter space with
# budget == 200, num_split == 100
params = {'C':10.**np.arange(-10, 10, .01), 
         'penalty':['l1','l2'], 
         'class_weight':[None, 'balanced']}
budget = 10
num_split = 5

# with random data, performs randomly (as we would expect)
ddm.test_motif_rep(day_motif, 
                   y_hypo, 
                   y_hyper, 
                   labels, 
                   params, 
                   budget, 
                   num_split)


0it [00:00, ?it/s][A
1it [00:01,  1.60s/it][A
2it [00:02,  1.45s/it][A
3it [00:03,  1.26s/it][A
4it [00:04,  1.21s/it][A
5it [00:05,  1.07s/it][A
[A

{'hyper': [0.5, 0.5, 0.5, 0.49025974025974028, 0.4759036144578313],
 'hypo': [0.57662337662337659, 0.43860946745562129, 0.5, 0.5, 0.5]}