# DL Model
This notebook defines, trains, and validates the DL model.

It first loads following data files generated from pre-processing steps. Input and output data Tensors are created and pushed to GPU memory. Then, model is defined, trained, and validated.

In [None]:
from gensim.models.keyedvectors import KeyedVectors
from gensim.parsing.preprocessing import preprocess_string
import random
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import os
import math
import csv
import pickle
import time
import bz2

In [None]:
# Set seed for random generators
seed = 24
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

In [None]:
# Set the device type as cpu or cuda depending upon the execution environment.

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

In [None]:
# Clone git repo containing all the pre-processed data files. THIS IS ONLY NEEDED IN CLOUD ENVIRONMENT.

!git clone https://github.com/manuv3/cs598-dl-project.git

In [None]:
DATA_DIR = './cs598-dl-project/data'

# USE BELOW FOR LOCAL TESTING
# DATA_DIR = '../data'

In [None]:
'''
Import the "codes" DataFrame, with HADM_ID (Hospitalization ID) as index and multi-hot encoding of 
ICD9-codes (booleans) as columns. Its dimensions are a 52691 rows × 6984 columns. So, we have 6984 ICD9 codes.
'''

codes = pd.read_pickle(DATA_DIR + '/diagnoses.pkl.gz')
codes

In [None]:
'''
Split the HADM_IDs (hospitalization IDs) into train-test in 90:10 ratio. Two lists are generated:
    - hadm_ids_train
    - hadm_ids_test
'''

hadm_ids_train, hadm_ids_test = train_test_split(codes.index.tolist(), test_size = 0.10, random_state=seed)
print(hadm_ids_train)

In [None]:
'''
Load the Doc2Vec embeddings for discharge summary reports, generated during pre-processing step. The data is in 
Gensim KeyedVector format.
'''

dv = KeyedVectors.load(DATA_DIR + '/dv.kv')

In [None]:
'''
Load the Word2Vec embeddings of all the words in vocabulary of the whole corpus, generated during the 
pre-processing step. This data is in Gensim KeyedVector format.
'''

wv = KeyedVectors.load(DATA_DIR + '/wv.kv')

In [None]:
'''
Load the dictionary mapping HADM_ID with a tokenized document. These tokens are basically the list of words 
belonging to the corresponding Discharge summary report.
'''

with bz2.open(DATA_DIR + '/tokens_map.pkl.bz2', 'rb') as handle:
  tokens_dict = pickle.load(handle)

In [None]:
# how many samples per batch to load
batch_size = 50

In [None]:
'''
Define the implementation of PyTorch Dataset. This is very light-weight, as all the data tensors (dv_train, 
dv_test, tokens_train, tokens_test, and codes_train) have already been pushed to GPU memory. So they can be 
referenced by HADM_ID index. This dataset simply returns the input index as the data in __getitem()__ method, which 
will be used during model training to access the input and output data from data tensors.
'''

class DocumentsDataset(Dataset):
    def __init__(self, hadm_ids):
        super(DocumentsDataset).__init__()
        self.hadm_ids = hadm_ids
    def __len__(self):
        return len(self.hadm_ids)
    def __getitem__(self, idx):
        hadm_id = self.hadm_ids[idx]
        tokens = tokens_dict[hadm_id]
        word_vecs = torch.Tensor(wv.__getitem__(tokens))
        x_cnn = torch.zeros((700, 100))
        x_cnn[0:word_vecs.shape[0]] = word_vecs
        x_d2v = torch.Tensor(dv[str(hadm_id)].tolist())
        y = torch.tensor(codes.loc[hadm_id].to_numpy())
        return x_d2v, x_cnn, y 

In [None]:
# prepare dataloaders
train_loader = DataLoader(DocumentsDataset(hadm_ids_train), batch_size = batch_size, shuffle = True)
test_loader = DataLoader(DocumentsDataset(hadm_ids_test), batch_size = batch_size)

print("# of train batches:", len(train_loader))
print("# of val batches:", len(test_loader))

## Create DL Model

The model is a DL network with two "logical" components:
- **Encoder** to generate document embeddings: The function of this component is to generate effective fixed-length embedding for a given discharge summary document.This component consists of two "logical" sub-components:
        - D2V: This sub-component first trains (as pre-processing step) Doc2Vec model to learn input document vectors of length `128`, in an unsupervised way. It then fine tunes this vector,using a fully connected layer of `64` neurons, followed by a non-linear activation like sigmoid. This fine-tune layer is trained in supervised way.
        - CNN: This sub-component trains a Word2Vec model as pre-processing step to build word vectors for the whole vocabulary of the collective corpus of documents. For each document, all the vectors corresponding to the contained words, are concatenated, to represent the given document. These document vectors are used as input to the CNN sub-component. This sub-component actually comprises of 3 single-layer multi-channel CNN models. Three CNN models correspond to 3 kernel sizes (of 3, 4, and 5 words)) with 64 output channels each. For CNN layer in each model is followed by a MaxPool layer to perform temporal pooling. The outputs of each of these CNN models are concatenated to generate the output vector per document of size `192 (3 models * 64 channels each)`. 

The ouput vectors from the two sub-components (D2V and CNN) are concatenated to produce the final vector for each document in the batch. Ths final vector size is `256 (64 from DNN + 192 from CNN)`.

- **Classifier** to perform multi-label classification of ICD-9 codes. This component consists of:
    - Dropout layer: The document vector generated by encoder component is regularized by stochastically dropping different dimensions.
	- Fully connected layer with sigmoid activation: This layer generates the final output of size `6984` (total number of ICD-9 codes)}. Each dimension (representing an ICD-9 code) is assigned a probability by sigmoid activation.


<img src="https://drive.google.com/uc?id=1oKffyBVaQxrDIqXc-ulP-AYSkBh_FnoO">


In [None]:
'''
Define the model class which represents the D2V sub-component that uses single fully connected layer to fine-tune 
the Doc2Vec embeddings. This representation is passed through a Dropout layer to perform regularization, and the output is the 
vector representing the input document.
'''

class D2V(nn.Module):
    def __init__(self, dropout = 0.20):
        super(D2V, self).__init__()
        self.fc = nn.Linear(128, 64)
        self.dropout = nn.Dropout(p = dropout)
    def forward(self, x):
        y = self.dropout(F.relu(self.fc(x)))
        return y

In [None]:
'''
Define the model class which is the building block of the CNN sub-component containining a single layer of 
multi-channel CNN kernel, activation function, and max-pooling layer.
'''

class CNN(nn.Module):
    def __init__(self, kernel_size, in_channels = 1, out_channels = 64):
        super(CNN, self).__init__()
        self.conv = nn.Conv2d(in_channels, out_channels, (kernel_size, 100))
        self.pool = nn.MaxPool2d((700 - kernel_size + 1,1))
    def forward(self, x):
        y = self.pool(F.relu(self.conv(x))).squeeze()
        return y

In [None]:
'''
Define the model class for CNN sub-component. This component concatenates the output of 3 CNN components (described 
previously), each corresponding to different kernel size (3, 4, and 5 words). Its input is "concatenated" Word2Vec 
embeddings of all words within a document, which it passes to the three CNN components parallely, and then combines 
their output. This representation is passed through a Dropout layer to perform regularization, and the output is the 
vector representing the input document.
'''

class CNN_COMBINED(nn.Module):
    def __init__(self, dropout = 0.20):
        super(CNN_COMBINED, self).__init__()
        self.conv_3 = CNN(3)
        self.conv_4 = CNN(4)
        self.conv_5 = CNN(5)
        self.dropout = nn.Dropout(p = dropout)
    def forward(self, x):
        x = x.unsqueeze(1)
        return self.dropout(torch.cat((self.conv_3(x), self.conv_4(x), self.conv_5(x)), dim = 1))

In [None]:
'''
Define the class for the main model, which uses both from D2V and CNN_COMBINED components, in the Encoder layer.
'''

class Net(nn.Module):
    def __init__(self, dropout_d2v, dropout_cnn):
        super(Net, self).__init__()
        self.d2v = D2V(dropout = dropout_d2v)
        self.cnn = CNN_COMBINED(dropout = dropout_cnn)
        self.fc2 = nn.Linear(256, 6984)

    def forward(self, x_d2v, x_cnn):
        y_d2v = self.d2v(x_d2v)
        y_cnn = self.cnn(x_cnn)
        y = torch.sigmoid(self.fc2(torch.cat((y_d2v, y_cnn), dim = 1)))
        return y

In [None]:
'''
Define the class for the cnn-only ablation model, which uses only the CNN_COMBINED component in the Encoder layer. 
'''

class Net_CNN(nn.Module):
    def __init__(self, dropout):
        super(Net_CNN, self).__init__()
        self.cnn = CNN_COMBINED(dropout)
        self.fc2 = nn.Linear(192, 6984)

    def forward(self, x):
        return torch.sigmoid(self.fc2(self.cnn(x)))

In [None]:
'''
Define the class for the d2v-only ablation model, which uses only the D2V component in the Encoder layer. 
'''

class Net_D2V(nn.Module):
    def __init__(self, dropout):
        super(Net_D2V, self).__init__()
        self.d2v = D2V(dropout = dropout)
        self.fc = nn.Linear(64, 6984)

    def forward(self, x):
        return torch.sigmoid(self.fc(self.d2v(x)))

## CNN + Attention Model (not part of original paper)

This model is to study the impact of attention over plain CNN, to demonstrate the effectiveness of leveraging attention instead of max-pooling of
CNN hidden states.

The archiecture of this model consists of a single layer multi-channel 1D CNN, an attention mechanism, and finally, the classification component consisting of a fully connected layer with a Sigmoid classifier.

The intuition behind this is that the originally proposed model uses Convolution with Max Pooling, which ignores contribution of all group of tokens except the one with max value. We may be able to do better by adding the weighted contribution of all groups of tokens (generated by convolution).

<img src = "https://drive.google.com/uc?id=1pF_ZXTsraff0yXY5pYzww5FMnEtbeilk" />

In [None]:
'''
Define the class for the cnn + attention model.
'''

class Net_CNN_Attention(nn.Module):
    def __init__(self, kernel_size, dropout):
        super(Net_CNN_Attention, self).__init__()
        self.conv = nn.Conv1d(100, 50, kernel_size)
        self.att = nn.Linear(50, 6984, bias = False)
        self.dropout = nn.Dropout(p = dropout)
        self.fc = nn.Linear(50, 1)
    def forward(self, x):
        x = x.permute((0, 2, 1))
        y = torch.relu(self.conv(x))
        # apply attention
        alpha = F.softmax(self.att.weight.matmul(y), dim=2)
        # compute output by matrix multiplication of attention matrix with internal state
        y = alpha.matmul(y.permute(0, 2, 1))
        y = self.dropout(y)
        y = torch.sigmoid(self.fc(y)).squeeze()
        return y

In [None]:
# Model Evaluation

from sklearn.metrics import *


def classification_metrics(Y_pred, Y_true):
    """
    Calculate peformance metrics using scikit-learn.
    
    Arguments:
        Y_pred: Long dtype Tensor of output values for the test set batch, as predicted by model.
        Y_true: Long dtype Tensor of true values in the test-set batch.
        
    Outputs:
        precision: overall micro-averaged precision score
        recall: overall micro-averaged recall score
        f1: overall micro-averaged f1 score
        
    REFERENCE: checkout https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
"""
    
    precision, recall, f1score = precision_score(Y_true, Y_pred, average = 'micro'), \
                                           recall_score(Y_true, Y_pred, average = 'micro'), \
                                           f1_score(Y_true, Y_pred, average = 'micro')
    return precision, recall, f1score


def evaluate(model, loader, threshold, only_cnn = False, only_d2v = False):
    """
    Evaluate the model.
    
    Arguments:
        model: Trained model of type nn.Module
        loader: Test DataLoader
        
    Outputs:
        precision: overall micro-averaged precision score
        recall: overall micro-averaged recall score
        f1: overall micro-averaged f1 score
    """
    
    model.eval()
    all_y_true = torch.LongTensor()
    all_y_pred = torch.LongTensor()
    for x_d2v,x_cnn, y in loader:
        if not only_cnn:
          x_d2v = x_d2v.to(device)
        if not only_d2v:
          x_cnn = x_cnn.to(device)
        if only_d2v:
          y_hat = model(x_d2v)
        elif only_cnn:
          y_hat = model(x_cnn)
        else:
          y_hat = model(x_d2v, x_cnn)
        y_pred = y_hat.detach().to('cpu').apply_(lambda x: 1 if x > threshold else 0)
        all_y_true = torch.cat((all_y_true, y.long()), dim=0)
        all_y_pred = torch.cat((all_y_pred,  y_pred.to('cpu').long()), dim=0)
        
    precision, recall, f1 = classification_metrics(all_y_pred.detach().numpy(), all_y_true.detach().numpy())
    print(f"precision: {precision:.3f}, recall: {recall:.3f}, f1: {f1:.3f}")
    return precision, recall, f1

In [None]:
def train_and_evaluate(model, name, train_loader, max_epochs, threshold, only_cnn = False, only_d2v = False, checkpoint = False):
  """
  Train and evaluate the model.

  Arguments:
      model: Trained model of type nn.Module
      train_loader: Training dataloader of type torch.utils.data.DataLoader
      
  Outputs:
      No output
  """
  
  epoch = 0
  while epoch < max_epochs:
    # prep model for training
    model.train()
    train_loss = 0
    for x_d2v, x_cnn, y in train_loader:
        if not only_cnn:
          x_d2v = x_d2v.to(device)
        if not only_d2v:
          x_cnn = x_cnn.to(device)
        y = y.float().to(device)
        optimizer.zero_grad()
        if only_d2v:
          y_hat = model(x_d2v)
        elif only_cnn:
          y_hat = model(x_cnn)
        else:
          y_hat = model(x_d2v, x_cnn)
        loss = criterion(y_hat, y)
        loss.backward()
        optimizer.step()
    print('Epoch: {} done!'.format(epoch))
    if epoch % 2 == 0:
        if checkpoint:
            save_checkpoint(model, name + '-' + str(epoch) + '.pth')
        # Evaluate the model
        evaluate(model, test_loader, threshold, only_cnn, only_d2v)
    epoch += 1

In [None]:
# Save model internal state as intermediate checkpoint

# Mount the google drive at 'drive' directory in the colab virtual machine.

from google.colab import drive
drive.mount('drive')

# Define variable to point to the project directory in google drive.
CHECKPOINT_DIR = 'drive/My Drive/cs598-dl/checkpoints/'

# For Local testing
# CHECKPOINT_DIR = '../checkpoints/'

def save_checkpoint(model, name):
    torch.save(model.state_dict(), CHECKPOINT_DIR + name)

## Train and evaluate main model

### Hyperparameters

- MAX_EPOCHS
- CLASSIFICATION_THRESHOLD
- DROPOUT_D2V
- DROPOUT_CNN

In [None]:
'''
Train and evaluate main model
'''
# Hyperparameters
MAX_EPOCHS = 20
CLASSIFICATION_THRESHOLD = 0.20
DROPOUT_D2V = 0.30
DROPOUT_CNN = 0.20

# Define the model
main_model = Net(DROPOUT_D2V, DROPOUT_CNN)

# Move to GPU
main_model.to(device)

# Define the loss function and optimizer for back-propagation.
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(main_model.parameters(), lr=0.001)

# Train and evaluate
sta = time.time()
train_and_evaluate(model = main_model, 
                   name = 'main_model', 
                   train_loader = train_loader, 
                   max_epochs = MAX_EPOCHS, 
                   threshold = CLASSIFICATION_THRESHOLD, 
                   checkpoint = False)
end = time.time()
print('Time taken in training and validating main model:' + str(end - sta))

## Train and evaluate cnn-only model (ablation)

### Hyperparameters

- MAX_EPOCHS
- CLASSIFICATION_THRESHOLD
- DROPOUT_CNN

In [None]:
'''
Train and evaluate cnn-only model (ablation)
'''

# Hyperparameters
MAX_EPOCHS = 15
CLASSIFICATION_THRESHOLD = 0.20
DROPOUT_CNN = 0.20

# Define the model
cnn_only_model = Net_CNN(DROPOUT_CNN)

# Move to GPU
cnn_only_model.to(device)

# Define the loss function and optimizer for back-propagation.
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(cnn_only_model.parameters(), lr=0.001)

# Train and evaluate
sta = time.time()
train_and_evaluate(model = cnn_only_model, 
                   name = 'cnn_only_model', 
                   train_loader = train_loader, 
                   max_epochs = MAX_EPOCHS, 
                   threshold = CLASSIFICATION_THRESHOLD, 
                   only_cnn = True, 
                   checkpoint = False)
end = time.time()
print('Time taken in training and validating cnn-only model:' + str(end - sta))

## Train and evaluate d2v-only model (ablation)

### Hyperparameters

- MAX_EPOCHS
- CLASSIFICATION_THRESHOLD
- DROPOUT_D2V

In [None]:
'''
Train and evaluate d2v-only model (ablation)
'''

# Hyperparameters
MAX_EPOCHS = 15
CLASSIFICATION_THRESHOLD = 0.18
DROPOUT_D2V = 0.10

# Define the model
d2v_only_model = Net_D2V(DROPOUT_D2V)

# Move to GPU
d2v_only_model.to(device)

# Define the loss function and optimizer for back-propagation.
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(d2v_only_model.parameters(), lr=0.001)

# Train and evaluate
sta = time.time()
train_and_evaluate(model = d2v_only_model, 
                   name = 'd2v_only_model', 
                   train_loader = train_loader, 
                   max_epochs = MAX_EPOCHS, 
                   threshold = CLASSIFICATION_THRESHOLD, 
                   only_d2v = True, 
                   checkpoint = False)
end = time.time()
print('Time taken in training and validating d2v-only model:' + str(end - sta))

## Train and evaluate cnn + attention model (additional model)

### Hyperparameters

- MAX_EPOCHS
- CLASSIFICATION_THRESHOLD
- DROPOUT_CNN
- KERNEL_SIZE

In [None]:
'''
Train and evaluate cnn + attention model (additional model)
'''

# Hyperparameters
MAX_EPOCHS = 20
CLASSIFICATION_THRESHOLD = 0.30
DROPOUT_CNN = 0.10
KERNEL_SIZE = 5

# Define the model
cnn_attention_model = Net_CNN_Attention(kernel_size = KERNEL_SIZE, dropout = DROPOUT_CNN)

# Move to GPU
cnn_attention_model.to(device)

# Define the loss function and optimizer for back-propagation.
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(cnn_attention_model.parameters(), lr=0.001)

# Train and evaluate
sta = time.time()
train_and_evaluate(model = cnn_attention_model, 
                   name = 'cnn_attention_model', 
                   train_loader = train_loader, 
                   max_epochs = MAX_EPOCHS, 
                   threshold = CLASSIFICATION_THRESHOLD, 
                   only_cnn = True, 
                   checkpoint = False)
end = time.time()
print('Time taken in training and validating cnn-with-attention model:' + str(end - sta))