# Citation to Original Paper
Xiong Liu, Yu Chen, Jay Bae, Hu Li, Joseph Johnston, and Todd Sanger. 2019. Predicting heart failure read-mission from clinical notes using deep learning.

# Reproducibility summary
"Predicting Heart Failure Readmission from Clinical Notes Using Deep Learning" The authors aim to prove that deep learning models are more accurate at the readmission prediction task as compared to other machine learning models. Specifically, the authors hypothesize that a CNN model will perform better than a regular machine learning model based on random forest [1]. We will test these same two models to compare performance.

We found the the CNN model that uses the Word2vec embeddings pre-trained on PubMed and PubMed centeral text does outperfom the RNN TF-IDF modal. With the CNN having an accuracy of 67% and an f1-score of .47, and the RF model having an accuracy of 61% and and f1-score of .47 for the 30-day readmission prediction.


# SETUP

Dependencies:

In [None]:
# pip install tensorflow==2.12.*
# pip install keras
# pip install nltk
# pip install -U scikit-learn
# pip install pandas
# pip install numpy
# pip install -U scikit-learn

In [1]:
import numpy as np
import pandas as pd
from gensim.models.word2vec import Word2Vec
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical
from keras.layers import Dense, Dropout, Conv1D, MaxPool1D, GlobalMaxPool1D, Embedding, Activation
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.models import Sequential
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.snowball import PorterStemmer
from sklearn import preprocessing
from time import time
from sklearn.metrics import classification_report
from tensorflow.keras.layers import Embedding
from tensorflow.keras import layers

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

# Data Download Instructions:

MIMIC-III Clinical Database [3]

Link to dataset: https://physionet.org/content/mimiciii/1.4/

1.   Using the link to the dataset, create a physinet account.
2.   Navigate to he bottom webapge of the link to the "Files" section
3.   Complete the listed required training: "CITI Data or Specimens Only Research"
4.  Submit your training to physinet
5.  You'll recive an email that your application is approved, you can login in to phsionet and access the data.

# Download word2Vec and glove Embeddigns
Word2Vec PubMed-and-PMC-w2v.bin 3.08GB http://bio.nlplab.org/
Glove Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB) https://github.com/stanfordnlp/GloVe



Paths to files and models

In [2]:
notes_path = '/content/drive/MyDrive/DL_Final_Project/NOTEEVENTS.csv'
admissions_path = '/content/drive/MyDrive/DL_Final_Project/ADMISSIONS.csv'
icd_codes_path = '/content/drive/MyDrive/DL_Final_Project/DIAGNOSES_ICD.csv'

#Word2Vec model used to generate embeddigns for CNN
biowordvecpath = '/content/drive/MyDrive/DL_Final_Project/PubMed-and-PMC-w2v.bin'

#glove model
glovepath = '/content/drive/MyDrive/DL_Final_Project/glove.42B.300d.txt'

In [3]:
# remove any HTML tags, non-word characters, numbers; convert all text to lowercase; remove stopwords
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    text = re.sub('[\W]+', ' ', text.lower()) 
    text = re.sub(" \d+", " ", text)
    
    #remove stop words
    tokens = [token for token in word_tokenize(text) if token not in punctuation and token not in stop_words]
    return ' '.join(tokens)

def create_train_and_val_sets(df_train, df_test):
  # x-train and y-train
    train_samples = df_train['text2'].tolist()
    train_labels = df_train['OUTPUT_LABEL'].tolist()

  # x-test and y-test
    val_samples = df_test['text2'].tolist()
    val_labels = df_test['OUTPUT_LABEL'].tolist()

    # get vocab list & assign an index to each vocab word in x-train
    vectorizer = TextVectorization(max_tokens=20000, output_sequence_length=200)
    text_ds = tf.data.Dataset.from_tensor_slices(train_samples).batch(128)
    vectorizer.adapt(text_ds)
    voc = vectorizer.get_vocabulary()
    word_index = dict(zip(voc, range(len(voc))))

    # finally vectorize x-train, x-test, y-train, y-test according to the above vectorizer
    x_train = vectorizer(np.array([[s] for s in train_samples])).numpy()
    x_val = vectorizer(np.array([[s] for s in val_samples])).numpy()

    y_train = np.array(train_labels)
    y_val = np.array(val_labels)

    return x_train, y_train, x_val, y_val, voc, word_index

def evaluate_model(model, x_test, y_test):
    Y_pred = np.argmax(model.predict(x_test),axis=1)
    y_true = y_test

    cnn_report1=classification_report(y_true,Y_pred,output_dict=True)
    df_cnn1=pd.DataFrame(cnn_report1).transpose()
    return df_cnn1

def subsample(df_train):
    pos_rows = df_train.OUTPUT_LABEL == 1
    df_train_pos = df_train.loc[pos_rows]
    df_train_neg = df_train.loc[~pos_rows]

    # # Merge the balanced data
    df_train_balanced = pd.concat([df_train_pos, df_train_neg.sample(n = len(df_train_pos), random_state=42)], axis = 0)

    # Shuffle the order of training samples
    df_train_balanced = df_train_balanced.sample(n = len(df_train_balanced), random_state = 42).reset_index(drop=True)

    return df_train_balanced

def modify_sentence(sentence, p=0.5):
    for i in range(len(sentence)):
        if np.random.random() > p:
            try:
                syns = reloaded_word_vectors.most_similar([sentence[i]])
                syns = [x for (x, y) in syns]
                sentence[i] = np.random.choice(syns)
            except KeyError:
                pass
            
    return sentence

# Loading MIMIC-III Raw Data & Creating Dataset

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
import pandas as pd

df_add = pd.read_csv(admissions_path) # 58976 rows of data
df_notes = pd.read_csv(notes_path, dtype='unicode') # 2083180 rows of data
df_codes = pd.read_csv(icd_codes_path) # 651047 rows of data

Diagnosis dataframe

In [7]:
# including only rows of data with heart failure ICD-9 codes
hf_codes = ['39891', '40201', '40211', '40291', '40401', '40403', '40411', '40413', '40491', '40493', '4280', '4281', '42820','42821', '42822', '42823', '42830', '42831', '42832', '42833', '42840', '42841', '42842', '42843','4289']
df_codes = df_codes.loc[df_codes.ICD9_CODE.isin(hf_codes)] # 651047 -> 21274 rows of data

# list of subject_ids asociated with hf_codes
hf_pid_list = df_codes["SUBJECT_ID"].tolist() 

Admissions dataframe

In [12]:
# change to standard datetime format
df_add.ADMITTIME = pd.to_datetime(df_add.ADMITTIME)
df_add.DISCHTIME = pd.to_datetime(df_add.DISCHTIME)

# remove elective admissions- we only want urgent and emergency
df_adm = df_add.loc[df_add.ADMISSION_TYPE != 'ELECTIVE']

# sort by subject id and admittime
df_adm = df_add.sort_values(['SUBJECT_ID','ADMITTIME'])
df_adm = df_adm.reset_index(drop = True)

# add a column for next admit_time (readmissions) and readmission id
df_adm['NEXT_ADMITTIME'] = df_adm.groupby('SUBJECT_ID').ADMITTIME.shift(-1)
df_adm['NEXT_HADM_ID'] = df_adm.groupby('SUBJECT_ID').HADM_ID.shift(-1)
df_adm = df_adm.sort_values(['SUBJECT_ID','ADMITTIME'])

# Back fill. This will take a little while.
df_adm[['NEXT_ADMITTIME','NEXT_HADM_ID']] = df_adm.groupby(['SUBJECT_ID'])[['NEXT_ADMITTIME','NEXT_HADM_ID']].fillna(method = 'bfill')
df_adm['DAYS_TIL_NEXT_ADMIT'] = (df_adm.NEXT_ADMITTIME - df_adm.DISCHTIME).dt.total_seconds()/(24*60*60)

Clinical Notes dataframe

In [13]:
# Choosing only discharge summary clinical notes
df_notes_dis_sum = df_notes.loc[df_notes.CATEGORY == 'Discharge summary'] # 2083180 -> 59652; 

# changing type to ints so it aligns with the datatype of the other dataframes
df_notes_dis_sum['SUBJECT_ID'] = df_notes_dis_sum['SUBJECT_ID'].astype(int)
df_notes_dis_sum['HADM_ID'] = df_notes_dis_sum['HADM_ID'].astype(int)

# selecting the last discharge summary for each admission, if there are multiple
df_notes_dis_sum_last = (df_notes_dis_sum.groupby(['SUBJECT_ID','HADM_ID']).nth(-1)).reset_index()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_notes_dis_sum['SUBJECT_ID'] = df_notes_dis_sum['SUBJECT_ID'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_notes_dis_sum['HADM_ID'] = df_notes_dis_sum['HADM_ID'].astype(int)


Merging Clinical Notes, Admissions, and Diagnoses Codes

In [14]:
# first selecting admissions for subjects that have hf
df_hf_adm = df_adm.loc[df_adm.SUBJECT_ID.isin(hf_pid_list)] # now 58976 -> 51113 -> 45321 -> 14746 rows of data

# concatenating ICD-9 codes for patient admissions with multiple hf diagnoses
df_subj_concat_icd_codes = df_codes[['SUBJECT_ID', 'HADM_ID', 'ICD9_CODE']].copy()
df_subj_concat_icd_codes = df_subj_concat_icd_codes.groupby(['SUBJECT_ID', 'HADM_ID'])['ICD9_CODE'].agg(' '.join).reset_index() # # 651047 -> 21274 -> 14040 rows of data

# merge the admissions and icd9-codes tables to get admissions involving hf diagnoses
df_hf_admissions = pd.merge(df_adm[['SUBJECT_ID','HADM_ID','ADMITTIME','DISCHTIME','ADMISSION_TYPE','DEATHTIME', 'NEXT_ADMITTIME', 'NEXT_HADM_ID', 'DAYS_TIL_NEXT_ADMIT']],
                        df_subj_concat_icd_codes, 
                        on = ['SUBJECT_ID', 'HADM_ID'],
                        how = 'inner')
# merge the admissions+icd-9 codes table with the discharge sumaries
df_hf_adm_notes = pd.merge(df_hf_admissions[['SUBJECT_ID','HADM_ID', 'NEXT_ADMITTIME', 'NEXT_HADM_ID', 'DAYS_TIL_NEXT_ADMIT']], df_notes_dis_sum_last, 
                        on = ['SUBJECT_ID', 'HADM_ID'],
                        how = 'inner')

# finally, create output labels for 30-day readmission; 0 for no 30-day readmission, 1 for 30-day readmission
df_hf_adm_notes['OUTPUT_LABEL'] = (df_hf_adm_notes.DAYS_TIL_NEXT_ADMIT < 30).astype('int') # consists of ____ 30-day readmission rows, and ___ without 30-day readmission rows


df_hf_adm_notes['id'] = df_hf_adm_notes.index

# shuffle input
df_adm_notes_merged = df_hf_adm_notes.sample(n=len(df_hf_adm_notes), random_state=42)
df_adm_notes_merged = df_adm_notes_merged.reset_index(drop=True)

# finalized dataset
df_final = df_adm_notes_merged.copy(deep=False) # 13755 records



# Methodology explanation/General Pipeline

Setup
1. imported NOTEEVENTS.csv, ADMISSIONS.csv, DIAGNOSES_ICD.csv
2. Filtred DIAGNOSES_ICD table to only include rows of data with heart failure ICD-9 codes:'39891', '40201', '40211', '40291', '40401', '40403', '40411', '40413', '40491', '40493', '4280', '4281', '42820','42821', '42822', '42823', '42830', '42831', '42832', '42833', '42840', '42841', '42842', '42843','4289'
3. used ADMISSIONS.csv to obatin emergency addmissions, which have readmission within 30 days and have diagnosis as heart failure
4. from NOTEEVENTS.csv, the discharge summaries for the addmission obtained from the previous step were extracted <br>
5. Merged notes, admissions and diagonses to preprocess and generate embeddings

Preprocessing <br>
6. preprocessed discharge summaries to remove special chacters, extra spacing and lower case all text <br>
7. split dataset into train and test sets

CNN <br>
8. used keras library to convert text corpus of notes to sequence of integers and padded sequences to length of  [2]
9. generate embeddigns using pre-trained Word2vec embeddings <br>
    1. if word is found in the embeddings, add to embedding matrix <br>
10. used keras to run CNN achitecure, and evalute performance <br>
11. For ablation test, Glove is compared to  pre-tranined Word2Vec and max pooling layer is compared to global average pooling layer of CNN. 

RF <br>
12. Random forset model used to compare performance


# Preprocessing

Preprocessing text to remove any HTML tags, non-word characters, numbers; convert all text to lowercase; remove stopwords

In [18]:
import re
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

stop_words = set(stopwords.words('english'))

t0 = time()
# Create new column with processed text
df_final['text2'] = df_final['TEXT'].apply(preprocessor)
elapsed=time() - t0

print("Time taken for preprocessing: ", elapsed, "seconds.")


# Split into test and train data
df_test = df_final.sample(frac=0.2, random_state=42) # 2751
df_train = df_final.drop(df_test.index) # 1104

Time taken for preprocessing:  73.49632239341736 seconds.


In [19]:
df_test = df_final.sample(frac=0.1, random_state=42) # 2751
df_train = df_final.drop(df_test.index) # 1104

# CNN

Preprocessing

In [20]:
# vectorize text to input through Word2Vec
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.layers import TextVectorization

# splitting data 
x_train, y_train, x_val, y_val, voc,  word_index = create_train_and_val_sets(df_train, df_test)

In [21]:
# import pretrained embeddings  
from gensim.models import KeyedVectors
from gensim.test.utils import datapath
import numpy as np  # Make sure that numpy is imported
from nltk.corpus import stopwords

#loading pre-trained BioWordVec
reloaded_word_vectors = KeyedVectors.load_word2vec_format(biowordvecpath, binary=True)

# remove stopwords from vectors since we removed them from our data
STOPWORDS_WORD2VEC = stopwords.words('english') 
keys_updated = [word for word in reloaded_word_vectors.key_to_index if word not in STOPWORDS_WORD2VEC]
index2word_set=set(keys_updated)

In [22]:
num_tokens = len(voc) + 2
embedding_dim = 200
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    if word in index2word_set: 
        embedding_matrix[i] =  reloaded_word_vectors[word]
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 19697 words (303 misses)


Training

In [23]:
embedding_layer = Embedding(
      num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,)
  
int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(2, activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)

In [25]:
from tensorflow.keras.optimizers import Adam
model.compile(
    loss="sparse_categorical_crossentropy", optimizer=Adam(lr=0.0001), metrics=["acc"]
)
model.fit(x_train, y_train, batch_size=128, epochs=15, validation_data=(x_val, y_val))



Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f05804070d0>

Evaluation

In [26]:
evaluate_model(model, x_val, y_val)



Unnamed: 0,precision,recall,f1-score,support
0,0.920163,0.980222,0.949243,2528.0
1,0.137931,0.035874,0.05694,223.0
accuracy,0.903671,0.903671,0.903671,0.903671
macro avg,0.529047,0.508048,0.503091,2751.0
weighted avg,0.856755,0.903671,0.876912,2751.0


# Random Forest

In [27]:
# for processing
import re
import nltk
from sklearn import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# get target values
y_train = df_train["OUTPUT_LABEL"].values
y_test = df_test["OUTPUT_LABEL"].values
X_train = df_train["text2"]
X_test = df_test["text2"]

t0 = time()

# transform training data into tf-idf vector - takes 1 minute to run
vectorizer = feature_extraction.text.TfidfVectorizer(max_features=10000, ngram_range=(1,2))
corpus = X_train # make sure using the right train data
vectorizer.fit(corpus)
X_train = vectorizer.transform(corpus)
x_test = vectorizer.transform(X_test)
dic_vocabulary = vectorizer.vocabulary_

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
predicted = rf.predict(x_test)


elapsed=time() - t0
print("Time taken for RF training: ", elapsed, "seconds.")

report = classification_report(y_test, predicted, output_dict=True)
df = pd.DataFrame(report).transpose()
df.to_latex()
df


Time taken for RF training:  117.63636875152588 seconds.


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  df.to_latex()


Unnamed: 0,precision,recall,f1-score,support
0,0.918939,1.0,0.957757,2528.0
1,0.0,0.0,0.0,223.0
accuracy,0.918939,0.918939,0.918939,0.918939
macro avg,0.459469,0.5,0.478879,2751.0
weighted avg,0.844448,0.918939,0.88012,2751.0


# CNN - with data balancing

In [28]:
# resplitting our test and train data so that train data isn't too small when we subsample
df_test = df_final.sample(frac=0.1, random_state=42) # 1376
df_train = df_final.drop(df_test.index) # 12379

In [29]:
print("Training Data set prevalence (n = {:d}):".format(len(df_train)), "{:.2f}%".format((df_train.OUTPUT_LABEL.sum()/len(df_train))*100))

Training Data set prevalence (n = 12379): 8.43%


In [30]:
# subsampling the negatives

print("Before subsampling: ")

print("Data set prevalence (n = {:d}):".format(len(df_final)), "{:.2f}%".format((df_final.OUTPUT_LABEL.sum()/len(df_final))*100))

print("Training Data set prevalence (n = {:d}):".format(len(df_train)), "{:.2f}%".format((df_train.OUTPUT_LABEL.sum()/len(df_train))*100))

print("Test Data set prevalence (n = {:d}):".format(len(df_test)), "{:.2f}%".format((df_test.OUTPUT_LABEL.sum()/len(df_test))*100))

df_train_balanced = subsample(df_train)

print("After subsampling: ")

print("Training Data set prevalence (n = {:d}):".format(len(df_train_balanced)), "{:.2f}%".format((df_train_balanced.OUTPUT_LABEL.sum()/len(df_train_balanced))*100))

Before subsampling: 
Data set prevalence (n = 13755): 8.32%
Training Data set prevalence (n = 12379): 8.43%
Test Data set prevalence (n = 1376): 7.27%
After subsampling: 
Training Data set prevalence (n = 2088): 50.00%


In [31]:
x_train, y_train, x_val, y_val, voc,  word_index = create_train_and_val_sets(df_train_balanced, df_test)

In [32]:
num_tokens = len(voc) + 2
embedding_dim = 200
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    if word in index2word_set: 
        embedding_matrix[i] =  reloaded_word_vectors[word]
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 18495 words (1505 misses)


In [33]:

from tensorflow.keras.optimizers import Adam

embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(2, activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)

model.compile(
    loss="sparse_categorical_crossentropy", optimizer=Adam(lr=0.0001), metrics=["acc"]
)
model.fit(x_train, y_train, batch_size=128, epochs=15, validation_data=(x_val, y_val))

evaluate_model(model, x_val, y_val)



Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


Unnamed: 0,precision,recall,f1-score,support
0,0.935857,0.697492,0.799282,1276.0
1,0.091765,0.39,0.148571,100.0
accuracy,0.675145,0.675145,0.675145,0.675145
macro avg,0.513811,0.543746,0.473926,1376.0
weighted avg,0.874513,0.675145,0.751992,1376.0


# Random Forest - with data balancing

In [34]:
# for processing
import re
import nltk
from sklearn import feature_extraction, model_selection, naive_bayes, pipeline, manifold, preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# get target values
y_train = df_train_balanced["OUTPUT_LABEL"].values
y_test = df_test["OUTPUT_LABEL"].values
X_train = df_train_balanced["text2"]
X_test = df_test["text2"]


# transform training data into tf-idf vector - takes 1 minute to run
vectorizer = feature_extraction.text.TfidfVectorizer(max_features=10000, ngram_range=(1,2))
corpus = X_train # make sure using the right train data
vectorizer.fit(corpus)
X_train = vectorizer.transform(corpus)
x_test = vectorizer.transform(X_test)
dic_vocabulary = vectorizer.vocabulary_

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
predicted = rf.predict(x_test)

report = classification_report(y_test, predicted, output_dict=True)
df = pd.DataFrame(report).transpose()
df.to_latex()
df


  df.to_latex()


Unnamed: 0,precision,recall,f1-score,support
0,0.958128,0.609718,0.745211,1276.0
1,0.117021,0.66,0.198795,100.0
accuracy,0.613372,0.613372,0.613372,0.613372
macro avg,0.537575,0.634859,0.472003,1376.0
weighted avg,0.897001,0.613372,0.7055,1376.0


# CNN - with data augmentation and balancing

In [35]:
print("Training Data set prevalence (n = {:d}):".format(len(df_train)), "{:.2f}%".format((df_train.OUTPUT_LABEL.sum()/len(df_train))*100)) # should have gone up from 8% to 11% after data aug

Training Data set prevalence (n = 12379): 8.43%


In [36]:

pos_rows = df_train.OUTPUT_LABEL == 1
df_pos = df_train.loc[pos_rows]
df_neg = df_train.loc[~pos_rows]

df_pos = df_pos.reset_index(drop=True)

df_pos['id2'] = df_pos.index

indexes = np.random.randint(0, df_pos.shape[0], 1000)

for num, i in enumerate(indexes):
    x = df_pos.loc[df_pos.id2==i]['text2']
    sample =  str(np.trim_zeros(x)).split()
    if str(i) in sample:
        sample.remove(str(i))

    modified = modify_sentence(sample)
    sentence_m = ' '.join([x for x in modified])

    df2 = pd.DataFrame({'text2': [sentence_m], 'OUTPUT_LABEL': [1]})
    df_pos = pd.concat([df2, df_pos], ignore_index = True)
    df_pos.reset_index()

df_train_aug = pd.concat([df_pos, df_neg], axis = 0)

In [37]:
print("Before subsampling: ")

print("Training Data set prevalence (n = {:d}):".format(len(df_train_aug)), "{:.2f}%".format((df_train_aug.OUTPUT_LABEL.sum()/len(df_train_aug))*100))

df_train_balanced_and_aug = subsample(df_train_aug)

print("After subsampling: ")

print("Training Data set prevalence (n = {:d}):".format(len(df_train_balanced_and_aug)), "{:.2f}%".format((df_train_balanced_and_aug.OUTPUT_LABEL.sum()/len(df_train_balanced_and_aug))*100))

Before subsampling: 
Training Data set prevalence (n = 13379): 15.28%
After subsampling: 
Training Data set prevalence (n = 4088): 50.00%


In [40]:
x_train, y_train, x_val, y_val, voc,  word_index = create_train_and_val_sets(df_train_balanced_and_aug, df_test)

In [41]:
num_tokens = len(voc) + 2
embedding_dim = 200
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    if word in index2word_set: 
        embedding_matrix[i] =  reloaded_word_vectors[word]
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 19221 words (779 misses)


In [42]:
embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)

x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)

x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)

x = layers.GlobalAveragePooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(2, activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
# model.summary()
model.compile(
    loss="sparse_categorical_crossentropy", optimizer=Adam(lr=0.0001), metrics=["acc"]
)
model.fit(x_train, y_train, batch_size=128, epochs=15, validation_data=(x_val, y_val))

evaluate_model(model, x_val, y_val)



Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


Unnamed: 0,precision,recall,f1-score,support
0,0.932261,0.862853,0.896215,1276.0
1,0.102564,0.2,0.135593,100.0
accuracy,0.81468,0.81468,0.81468,0.81468
macro avg,0.517412,0.531426,0.515904,1376.0
weighted avg,0.871963,0.81468,0.840937,1376.0


# Random Forest - with data augmentation and balancing

In [43]:
# # ## get target values
Y_train = df_train_balanced_and_aug["OUTPUT_LABEL"].values
Y_test = df_test["OUTPUT_LABEL"].values
X_train = df_train_balanced_and_aug["text2"]
X_test = df_test["text2"]


# transform training data into tf-idf vector - takes 1 minute to run
vectorizer = feature_extraction.text.TfidfVectorizer(max_features=10000, ngram_range=(1,2))
corpus = X_train # make sure using the right train data
vectorizer.fit(corpus)
X_train = vectorizer.transform(corpus)
X_test = vectorizer.transform(X_test)
dic_vocabulary = vectorizer.vocabulary_

rf = RandomForestClassifier()
rf.fit(X_train, Y_train)
predicted = rf.predict(X_test)

report = classification_report(Y_test, predicted, output_dict=True)
df = pd.DataFrame(report).transpose()
df.to_latex()
df

  df.to_latex()


Unnamed: 0,precision,recall,f1-score,support
0,0.930268,0.951411,0.940721,1276.0
1,0.126761,0.09,0.105263,100.0
accuracy,0.888808,0.888808,0.888808,0.888808
macro avg,0.528514,0.520705,0.522992,1376.0
weighted avg,0.871874,0.888808,0.880004,1376.0


# Ablation 1: CNN + Global average pooling vs Max pooling

In [44]:
# from keras.layers import GlobalAveragePooling1D
embedding_layer = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,
)

int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)

x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)

x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)

x = layers.GlobalAveragePooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(2, activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)
model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding_3 (Embedding)     (None, None, 200)         4000400   
                                                                 
 conv1d_9 (Conv1D)           (None, None, 128)         128128    
                                                                 
 max_pooling1d_6 (MaxPooling  (None, None, 128)        0         
 1D)                                                             
                                                                 
 conv1d_10 (Conv1D)          (None, None, 128)         82048     
                                                                 
 max_pooling1d_7 (MaxPooling  (None, None, 128)        0         
 1D)                                                       

In [46]:
x_train, y_train, x_val, y_val, voc,  word_index = create_train_and_val_sets(df_train_balanced_and_aug, df_test)

model.compile(
    loss="sparse_categorical_crossentropy", optimizer=Adam(lr=0.0001), metrics=["acc"]
)
model.fit(x_train, y_train, batch_size=128, epochs=15, validation_data=(x_val, y_val))

evaluate_model(model, x_val, y_val)



Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


Unnamed: 0,precision,recall,f1-score,support
0,0.930958,0.655172,0.769089,1276.0
1,0.079498,0.38,0.131488,100.0
accuracy,0.635174,0.635174,0.635174,0.635174
macro avg,0.505228,0.517586,0.450289,1376.0
weighted avg,0.869078,0.635174,0.722752,1376.0


# Ablation 2: Comparing GloVe to pre-trained Word2Vec

In [47]:
#loading glove model
embeddings_dict = {}

with open(glovepath, 'r', encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector


In [48]:
num_tokens = len(voc) + 2
embedding_dim = 300
hits = 0
misses = 0

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    if word in embeddings_dict:
        embedding_matrix[i] =  embeddings_dict[word]
        hits += 1
    else:
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))

Converted 17296 words (2704 misses)


In [49]:
embedding_layer = Embedding(
      num_tokens,
    embedding_dim,
    embeddings_initializer=keras.initializers.Constant(embedding_matrix),
    trainable=False,)
  
int_sequences_input = keras.Input(shape=(None,), dtype="int64")
embedded_sequences = embedding_layer(int_sequences_input)
x = layers.Conv1D(128, 5, activation="relu")(embedded_sequences)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.MaxPooling1D(5)(x)
x = layers.Conv1D(128, 5, activation="relu")(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dense(128, activation="relu")(x)
x = layers.Dropout(0.5)(x)
preds = layers.Dense(2, activation="softmax")(x)
model = keras.Model(int_sequences_input, preds)

In [50]:
model.compile(
    loss="sparse_categorical_crossentropy", optimizer=Adam(lr=0.0001), metrics=["acc"]
)
model.fit(x_train, y_train, batch_size=128, epochs=15, validation_data=(x_val, y_val))

evaluate_model(model, x_val, y_val)



Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


Unnamed: 0,precision,recall,f1-score,support
0,0.927495,0.771944,0.842601,1276.0
1,0.073248,0.23,0.111111,100.0
accuracy,0.732558,0.732558,0.732558,0.732558
macro avg,0.500372,0.500972,0.476856,1376.0
weighted avg,0.865413,0.732558,0.78944,1376.0


# Referneces
[1] Xiong Liu, Yu Chen, Jay Bae, Hu Li, Joseph Johnston, and Todd Sanger. 2019. Predicting heart failure read-mission from clinical notes using deep learning.
[2] Franc ̧ ois Chollet. 2020. Using pre-trained word embeddings. 
[3] Johnson AE, Pollard TJ, Shen L, Lehman LW, Feng M, Ghassemi M, and Moody B. 2016b. MIMIC-III, a freely accessible critical care database - Scientific Data — nature.com.
