In [5]:
import pandas as pd
import json
import os

#### Rollup of results from the LSTM and RTP experiments

In [25]:
# collect LSTM metrics
metrics = []
folders = os.listdir('./LSTM/results')
for folder in folders:
    json_path = os.path.join('./LSTM/results', folder, 'metrics.json')
    with open(json_path, 'r') as FP:
        m = json.load(FP)
    m['result set'] = folder
    metrics.append(m)
df_lstm = pd.DataFrame(metrics, columns=['result set', 'accuracy', 'precision', 
                                    'recall', 'f1', 'auc'])
df_lstm.sort_values(by='f1', ascending=False, inplace=True)

# collect RTP metrics
rtp_results_path = os.path.join('./RTP/results', 'test_results.csv')
df_rtp = pd.read_csv(rtp_results_path)
df_rtp = df_rtp[df_rtp['train/test'] == 'test']
df_rtp.drop(columns=['train/test'], inplace=True)

### Overall

Initially the original positive (shock) and negative (non-shock) datasets were split into training and test sets. The method was to randomly select visits (by the 'VisitIdentifier' column), and pull those out as test examples. The remaining visits were stored as training examples. Those data sets were then stored to be used in all experiments, both with RTP and LSTM models, so as to maintain consistency. It was determined that there were not overlaps between the training and testing sets as well, to avoid data leakage. The test data sets were set to 10% of the total data sets.

This way it was possible to properly evaluate the resultant models (within the limits of the available data) with "unseen" data, which ideally can give a better idea of how the models may be used in clinical practice.

### RTP results

Experiments were run with both Support Vector Machine and Logisitic Regression classifiers. Each was run with a set of parameters for the RTP mining, as follows.

1. Maximum gap: explore between 4 and 10 hours, in increments of 1;
2. Support for the positive cases (shock): explore between 0.1 and 0.3, in increments of 0.05
3. Support for negative cases(non-shock): explore between 0.1 and 0.3, in increments of 0.05

In additions, the SVM model was trained with different kernels: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’.

The best results for each classifier were then selected, and the corresponding models retrained and used to predict the held out test data set. These are as below:

In [26]:
df_rtp

Unnamed: 0,model,kernel,max_gap,min_support_pos,min_support_neg,accuracy,precision,recall,f1,auc
1,"SVC(gamma=0.1, kernel='poly')",poly,9.0,0.1,0.1,0.837838,0.755102,1.0,0.860465,0.837838
3,LogisticRegression(),,10.0,0.3,0.3,0.824324,0.772727,0.918919,0.839506,0.824324


The best model here seems to by the Support Vector Machine, using the 'poly' kernel. On the test data, it showed a recall (or sensitivity) of 100%. That means 100% true positives (and no false negatives). It does show a lower specificity: 67.6%. In other words, a false positive rate of 32.4%. It *is* appropriate to err on the side of false positives, as it may still be worth for a patient to receive closer observation in the next 24h hours. However, too high a false positive rate may be draining of recources unnecessarily as well. One possibility to use a threshold of the probability of shock in the prediction window, rather than a binary prediction, for triage purposes.

### LSTM results

There were several experiments conducted with LSTM and bidirectional LSTM models. The data was preprocessed in two different ways: time unaware sequences (without concern about the varying intervals between observations), and "time reconstructed" sequences, in which an attempt was made to reconstitute the observations as having consistent time intervals (30 minutes). In latter preprocessing (or in the latter case feature engineering) the values were both discretized (as integers) into categorical features, or left as is (no scaling applied).

In the "naive" approach the data set was reshaped into a 3 dimensional array: the shape being (visits, observations, values) in order to prepare it for passing through the LSTM model. Padding was applied to make the second dimsnsion consistent to a maximum number of time steps. 

The method of reconsitution consisted of first, using logic parallel to that used in the RTP mining, the observations were first transformed in Multivariant State Sequences. From those a consistent timeline of the events was reconstructed. E.g., given an sequence of MSS:

[('temperature', N, 0, 55), ('systolicBP', H, 0, 26), ('systolicBP', N, 27, 63), ('temperature', H, 56, 123), ('systolicBP', L, 90, 122)]

It would become (concepptually):

Legend:

| time (min) | temperature | systolicBP |
| ---- | ----------- | ---------- |
| 0    | N           | H          |
|30    | N           | N          |
|60    | N           | N          |
|90    | H           | L          |
|120   | H           | L          |

This was sotred as a 3 dimensional array, the visits, time steps, and values. The second dimension was the maximum number of time steps of all the visits, with the array padded with -10 as needed. The Keras masking layer then preceded the LSTM layer in the model architecture to cause the model to not consider the padding values.

One issue was that in either case was with computational recources. There were a handful (3-6) visits with very long time spans: as much as over 30,000 observations. Complete runs would have taken days, even weeks, which would not be possible in the timeframe of this project. In order to make the experiments more tractable (especially on a CPU w/o GPU), the small minority of visits were dropped as "temporal outliers." Even then the reconstructed timelines were run on Google Colab to take advantage of its GPU support.

In each case, the best results were selected and the corresponding model was run to predict on the test data set. 

Results are as shown below, sorted from best to worst:

| result set | description |
| ----- | ----------- |
| naive_results | Vanilla LSTM, unaware of varying time intervals |
| bi_results | Bidirectional LSTM, unaware of varying time intervals |
| tr_results | Vanilla LSTM, reconstructed timeline with discretized values |
| tr_noencode_results | Vanilla LSTM, reconstituted timeline with continuous values |
| tr_bi_results | Bidirectional LSTM, reconstructed timeline with discretized values |
| tr_noencode_bi_results | Bidirectional LSTM, reconstituted timeline with continuous values |

In [27]:
df_lstm

Unnamed: 0,result set,accuracy,precision,recall,f1,auc
1,naive_results,0.891892,0.837209,0.972973,0.9,0.891892
0,bi_results,0.878378,0.833333,0.945946,0.886076,0.878378
3,tr_noencode_bi_results,0.864865,0.846154,0.891892,0.868421,0.864865
4,tr_noencode_results,0.837838,0.857143,0.810811,0.833333,0.837838
5,tr_results,0.72973,0.717949,0.756757,0.736842,0.72973
2,tr_bi_lstm_results,0.554054,0.529412,0.972973,0.685714,0.554054


Interestingly, the most naive approach (ignoring uneven intervals between time steps) produces the highest accuracy, F1 score, AUC< etc., despite that such time sequences are usually not optimal for plain LSTM layers. Bidirectional LSTM scored only slightly less well.

Also, the unscaled values for features (such as systolicPB) gave better results than discretization of the values along quantile lines. Time reconstructed models perform better on the raw values than on discretized values. The latter with bidirectional LSTM performed utterly worse, with a mere accuracy of 55% (not much better than pure chance).