# Before you use this template

This template is just a recommended template for project Draft/Final Report. It only considers the general type of research in our paper pool:

1. Define a predictive problem in healthcare.
2. Propose a data/feature engineering method to process (MIMIC III or other) datasets.
3. Propose a neural network and train it supervisely.
4. Compare the model with other deep learning models.
5. Mostly use Python as the coding language. *(No worries if your coding language is not python, Colab is backended with Linux server and can be easily setup to support all kinds of coding lanuague, there are even code templates, you just need to make some serach online)*

---

# FAQ and Attentions
* Copy and move this template to your Google Drive. Name your notebook by your team ID (upper-left corner). Don't eidt this original file.
* This template covers most questions we want to ask about your reproduction experiment. You don't need to exactly follow the template, however, you should address the questions. Please feel free to customize your report accordingly.
* any report must have run-able codes and necessary annotations (in text and code comments).
* The notebook is like a demo and only uses small-size data (a subset of original data or processed data), the entire runtime of the notebook including data reading, data process, model training, printing, figure plotting, etc,
must be within 8 min, otherwise, you may get penalty on the grade.
  * If the raw dataset is too large to be loaded  you can select a subset of data and pre-process the data, then, upload the subset or processed data to Google Drive and load them in this notebook.
  * If the whole training is too long to run, you can only set the number of training epoch to a small number, e.g., 3, just show that the training is runable.
  * For results model validation, you can train the model outside this notebook in advance, then, load pretrained model and use it for validation (display the figures, print the metrics).
* The post-process is important! For post-process of the results,please use plots/figures. The code to summarize results and plot figures may be tedious, however, it won't be waste of time since these figures can be used for presentation. While plotting in code, the figures should have titles or captions if necessary (e.g., title your figure with "Figure 1. xxxx")
* There is not page limit to your notebook report, you can also use separate notebooks for the report, just make sure your grader can access and run/test them.
* If you use outside resources, please refer them (in any formats). Include the links to the resources if necessary.

# Mount Notebook to Google Drive
Upload the data, pretrianed model, figures, etc to your Google Drive, then mount this notebook to Google Drive. After that, you can access the resources freely.

Instruction: https://colab.research.google.com/notebooks/io.ipynb

Example: https://colab.research.google.com/drive/1srw_HFWQ2SMgmWIawucXfusGzrj1_U0q

Video: https://www.youtube.com/watch?v=zc8g8lGcwQU

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# Introduction
This is an introduction to your report, you should edit this text/mardown section to compose. In this text/markdown, you should introduce:

*   Background of the problem
  * what type of problem: disease/readmission/mortality prediction,  feature engineeing, data processing, etc
  * what is the importance/meaning of solving the problem
  * what is the difficulty of the problem
  * the state of the art methods and effectiveness.
*   Paper explanation
  * what did the paper propose
  * what is the innovations of the method
  * how well the proposed method work (in its own metrics)
  * what is the contribution to the reasearch regime (referring the Background above, how important the paper is to the problem).


In [None]:
# code comment is used as inline annotations for your coding

# Scope of Reproducibility:

List hypotheses from the paper you will test and the corresponding experiments you will run.


1.   Hypothesis 1: xxxxxxx
2.   Hypothesis 2: xxxxxxx

You can insert images in this notebook text, [see this link](https://stackoverflow.com/questions/50670920/how-to-insert-an-inline-image-in-google-colaboratory-from-google-drive) and example below:

![sample_image.png](https://drive.google.com/uc?export=view&id=1g2efvsRJDxTxKz-OY3loMhihrEUdBxbc)



You can also use code to display images, see the code below.

The images must be saved in Google Drive first.


In [None]:
import os
os.chdir('/content/drive/My Drive/multimodal-clinical-outcome')


# Methodology

This methodology is the core of your project. It consists of run-able codes with necessary annotations to show the expeiment you executed for testing the hypotheses.

The methodology at least contains two subsections **data** and **model** in your experiment.

In [3]:
# import  packages you need
import numpy as np
#from google.colab import drive

import os
import torch
import torch.nn as nn
import pandas as pd
from torch.utils.data import DataLoader, TensorDataset,Dataset
import torch.optim as optim
import time
from sklearn.metrics import roc_auc_score, average_precision_score, f1_score
from plots import plot_learning_curves, plot_confusion_matrix
import numpy as np
from torch.nn import functional as F
from operator import itemgetter

In [4]:
MODE = "BOTH"  # Options: 'BOTH', 'TRAIN', 'TEST'
TARGET_VARIABLES = ["los_3", "los_7"]  # Options: 'mort_hosp', 'mort_icu', 'los_3', 'los_7'
NUMBER_OF_WORKERS = 0
EPOCHS = 1
BATCH_SIZE = 32
LEARNING_RATE = 0.0001
SIGMOID_THRESHOLD = 0.5

DATASET_FILE_PATH = "../output"
PATH_OUTPUT = "../output/model/"

##  Data
Data includes raw data (MIMIC III tables), descriptive statistics (our homework questions), and data processing (feature engineering).
  * Source of the data: where the data is collected from; if data is synthetic or self-generated, explain how. If possible, please provide a link to the raw datasets.
  * Statistics: include basic descriptive statistics of the dataset like size, cross validation split, label distribution, etc.
  * Data process: how do you munipulate the data, e.g., change the class labels, split the dataset to train/valid/test, refining the dataset.
  * Illustration: printing results, plotting figures for illustration.
  * You can upload your raw dataset to Google Drive and mount this Colab to the same directory. If your raw dataset is too large, you can upload the processed dataset and have a code to load the processed dataset.

In [5]:
DATASET_FILE_PATH = "../output"
PATH_OUTPUT = "../output/model/"

def read_data():
	train_X = pd.read_pickle(f"{DATASET_FILE_PATH}/train_features.pkl")
	train_Y = pd.read_pickle(f"{DATASET_FILE_PATH}/train_labels.pkl")

	validation_X = pd.read_pickle(f"{DATASET_FILE_PATH}/validation_features.pkl")
	validation_Y = pd.read_pickle(f"{DATASET_FILE_PATH}/validation_labels.pkl")

	test_X = pd.read_pickle(f"{DATASET_FILE_PATH}/test_features.pkl")
	test_Y = pd.read_pickle(f"{DATASET_FILE_PATH}/test_labels.pkl")

	patient_embed = pd.read_pickle(f"{DATASET_FILE_PATH}/patient_embeddings.pkl")
  

	return (train_X, train_Y), (validation_X, validation_Y), (test_X, test_Y), patient_embed

## Dataset and Collate function

In [6]:

class TimeSeriesDataset(Dataset):
	def __init__(self,seqs,embeddings,labels):

		if len(seqs) != len(labels):
			raise ValueError("Seqs and Labels have different lengths")
		self.labels = labels
		self.seqs = seqs #  torch view 24 X 104
		self.embed = embeddings
	
	def __len__(self):
		return len(self.labels)
	
	def __getitem__(self,index):
		return self.embed[index],self.seqs[index], self.labels[index]

def collate_embeddings(batch):
	"""
	'batch' is a list [(embed_1,labs_1, label_1), (embed_2,labs_2, label_2), ... , (embed_N,labs_N, label_N)]
	:returns 
		seqs (FloatTensor) - 3D BACTCH SIZE X max_lenght X num_features(100=dim embeddings)
		lenghts(LongTensor) - 1D of batch size
		labels (LongTensor) - 1D of batch size
	"""
	max_len = max([seq[0].shape[0] for seq in batch])
	new_seqs=[]
	for seq in batch:
		labs = seq[1]
		label = seq[2]
		length_ = seq[0].shape[0]
		pad = np.zeros((max_len-seq[0].shape[0],seq[0].shape[1]))
		new_seq = np.concatenate((seq[0], pad), axis=0)

		new_seqs.append((new_seq,length_,labs,label))
		
	#sorted list for the batch		
	new_sorted_seqs = sorted(new_seqs, key=itemgetter(1),reverse=True)
	

	embed_tensor = torch.FloatTensor(np.array([seq[0] for seq in new_sorted_seqs]))
	labs_tensor = torch.stack([seq[2] for seq in new_sorted_seqs])
	labels_tensor = torch.LongTensor(np.array([seq[3] for seq in new_sorted_seqs]))

	return (embed_tensor,labs_tensor),labels_tensor


In [7]:


def dataframe_to_tensorDataset(train, validation, test, targetVariable,embeddings):

	patient_ids_with_embeddings = [id for id in embeddings.SUBJECT_ID]

	#print("# patients with ids:",len(patient_ids_with_embeddings))

	train_X, train_Y = train
	validation_X, validation_Y = validation
	test_X, test_Y = test

	# Only keepinf patients with embeddings
	train_X = train_X[train_X.index.get_level_values("subject_id").isin(patient_ids_with_embeddings)]
	train_Y = train_Y[train_Y.index.get_level_values("subject_id").isin(patient_ids_with_embeddings)]

	validation_X = validation_X[validation_X.index.get_level_values("subject_id").isin(patient_ids_with_embeddings)]
	validation_Y = validation_Y[validation_Y.index.get_level_values("subject_id").isin(patient_ids_with_embeddings)]

	test_X = test_X[test_X.index.get_level_values("subject_id").isin(patient_ids_with_embeddings)]
	test_Y = test_Y[test_Y.index.get_level_values("subject_id").isin(patient_ids_with_embeddings)]

	train_embed = embeddings[embeddings['SUBJECT_ID'].isin(train_X.index.get_level_values("subject_id"))]
	validation_embed = embeddings[embeddings['SUBJECT_ID'].isin(validation_X.index.get_level_values("subject_id"))]
	test_embed = embeddings[embeddings['SUBJECT_ID'].isin(test_X.index.get_level_values("subject_id"))]
	
	# data_train = torch.tensor(train_X.values, dtype=torch.float32).view(-1, 24, 104)
	#print("train shape:",data_train.shape)
	#print("train index=",data_train[0].shape)
	#print(len(set(train_X.index.get_level_values("subject_id"))))

	train = TimeSeriesDataset(seqs = torch.tensor(train_X.values, dtype=torch.float32).view(-1, 24, 104),
							  embeddings = train_embed['word2vec'].tolist(),
								labels = torch.tensor(train_Y[targetVariable].values))
	validation = TimeSeriesDataset(seqs = torch.tensor(validation_X.values, dtype=torch.float32).view(-1, 24, 104),
							  embeddings= validation_embed['word2vec'].tolist(),
								labels = torch.tensor(validation_Y[targetVariable].values))
	test = TimeSeriesDataset(seqs = torch.tensor(test_X.values, dtype=torch.float32).view(-1, 24, 104),
							  embeddings= test_embed['word2vec'].tolist(),
								labels = torch.tensor(test_Y[targetVariable].values))


	# train = TensorDataset(torch.tensor(train_X.values, dtype=torch.float32).view(-1, 24, 104),
	#                       torch.tensor(train_Y[targetVariable].values))
	# validation = TensorDataset(torch.tensor(validation_X.values, dtype=torch.float32).view(-1, 24, 104),
	#                            torch.tensor(validation_Y[targetVariable].values))
	# test = TensorDataset(torch.tensor(test_X.values, dtype=torch.float32).view(-1, 24, 104),
	#                      torch.tensor(test_Y[targetVariable].values))

	return train, validation, test


##   Model
The model includes the model definitation which usually is a class, model training, and other necessary parts.
  * Model architecture: layer number/size/type, activation function, etc
  * Training objectives: loss function, optimizer, weight of each loss term, etc
  * Others: whether the model is pretrained, Monte Carlo simulation for uncertainty analysis, etc
  * The code of model should have classes of the model, functions of model training, model validation, etc.
  * If your model training is done outside of this notebook, please upload the trained model here and develop a function to load and test it.

In [8]:
# My Model
class ConvolutionalNERGRU(nn.Module):
	def __init__(self):
		super(ConvolutionalNERGRU, self).__init__()

		self.conv1 = nn.Conv1d(in_channels=100,out_channels=32,kernel_size=3,padding=2,stride=1)
		self.conv2 = nn.Conv1d(in_channels=32,out_channels=64,kernel_size=3,padding=2,stride=1)
		self.conv3 = nn.Conv1d(in_channels=64,out_channels=96,kernel_size=3,padding=2,stride=1)
		self.globalpooling= nn.AdaptiveMaxPool1d(1)
		self.flatten = nn.Flatten()
		self.dropout1= nn.Dropout(0.2)

		self.gru = nn.GRU(104, 256, dropout=0.2, batch_first=True)
		self.sigmoid = nn.ReLU()
		self.hiddenLayer = nn.Linear(352, 2)

	def forward(self, input):
		embed,seqs = input
		out_gru, _ = self.gru(seqs)


		#cnn takes input of shape (batch_size, channels, seq_len)
		out_embed = embed.permute(0,2,1)
		out_embed = F.relu(self.conv1(out_embed))
		out_embed = F.relu(self.conv2(out_embed))
		out_embed = F.relu(self.conv3(out_embed))
		out_embed = self.globalpooling(out_embed)
		out_embed = self.flatten(out_embed)

		output = torch.concat((out_embed,out_gru[:,-1,:]),dim=1)

		output = self.hiddenLayer(output)
		output = self.sigmoid(output)

		return output
	
def sigmoid_predict(output):
	results = []

	with torch.no_grad():
		for data in output:
			results.append(int(data[1] > SIGMOID_THRESHOLD))

	return torch.tensor(results)

### Training Loop

In [9]:

#
# Citation: Jimeng Sun, (2024). CSE6250: Big Data Analytics in Healthcare Homework 4
# Used code from utils.py
#
class Metrics:
	def __init__(self):
		self.value = 0
		self.average = 0
		self.sum = 0
		self.count = 0

	def update(self, value, n=1):
		self.value = value
		self.sum += self.value * n
		self.count += n
		self.average = self.sum / self.count


def calculate_accuracy(predicted, target):
	with torch.no_grad():
		batchSize = target.size(0)
		correct = predicted.eq(target).sum().item()

		return (correct / batchSize) * 100.0


#
# Citation: Jimeng Sun, (2024). CSE6250: Big Data Analytics in Healthcare Homework 4
# Used code from utils.py and train_seizure.py
#
def train_model(model, device, dataLoader, criterion, optimizer, epoch):
	batchTime = Metrics()
	dataTime = Metrics()
	losses = Metrics()
	accuracy = Metrics()

	model.train()

	end = time.time()

	for i, (input, target) in enumerate(dataLoader):
		dataTime.update(time.time() - end)

		#input = input.to(device)
		if isinstance(input, tuple):
			input = tuple([e.to(device) if type(e) == torch.Tensor else e for e in input])
		else:
			input = input.to(device)
		target = target.to(device)

		optimizer.zero_grad()
		output = model(input)
		loss = criterion(output, target)

		loss.backward()
		optimizer.step()

		batchTime.update(time.time() - end)
		end = time.time()

		losses.update(loss.item(), target.size(0))
		accuracy.update(calculate_accuracy(sigmoid_predict(output), target), target.size(0))

		print(f'Epoch: [{epoch}][{i}/{len(dataLoader)}]\t'
		      f'Time {batchTime.value:.3f} ({batchTime.average:.3f})\t'
		      f'Data {dataTime.value:.3f} ({dataTime.average:.3f})\t'
		      f'Loss {losses.value:.4f} ({losses.average:.4f})\t'
		      f'Accuracy {accuracy.value:.3f} ({accuracy.average:.3f})')

	return losses.average, accuracy.average


#
# Citation: Jimeng Sun, (2024). CSE6250: Big Data Analytics in Healthcare Homework 4
# Used code from utils.py and train_seizure.py
#
def test_model(model, device, dataLoader, criterion):
	batchTime = Metrics()
	losses = Metrics()
	accuracy = Metrics()

	results = []

	model.eval()

	with torch.no_grad():
		end = time.time()

		for i, (input, target) in enumerate(dataLoader):
			#input = input.to(device)

			if isinstance(input, tuple):
				input = tuple([e.to(device) if type(e) == torch.Tensor else e for e in input])
			else:
				input = input.to(device)
			target = target.to(device)

			output = model(input)
			loss = criterion(output, target)

			batchTime.update(time.time() - end)
			end = time.time()

			losses.update(loss.item(), target.size(0))

			predicted = sigmoid_predict(output)
			accuracy.update(calculate_accuracy(predicted, target), target.size(0))

			true = target.detach().cpu().numpy().tolist()
			predicted = predicted.detach().cpu().numpy().tolist()
			results.extend(list(zip(true, predicted)))

			print(f'Test: [{i}/{len(dataLoader)}]\t'
			      f'Time {batchTime.value:.3f} ({batchTime.average:.3f})\t'
			      f'Loss {losses.value:.4f} ({losses.average:.4f})\t'
			      f'Accuracy {accuracy.value:.3f} ({accuracy.average:.3f})')

	return losses.average, accuracy.average, results


### Model fitting

In [10]:
#
# Citation: Jimeng Sun, (2024). CSE6250: Big Data Analytics in Healthcare Homework 4
# Used code from train_seizure.py
#
def model_runner(train, validation, test, targetVariable):
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# model = GRU()
	model = ConvolutionalNERGRU()
	model.to(device)

	criterion = nn.CrossEntropyLoss()
	#criterion = nn.BCEWithLogitsLoss()
	criterion.to(device)

	if MODE.upper() == "BOTH" or MODE.upper() == "TRAIN":
		trainLoader = DataLoader(train, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_embeddings,num_workers=NUMBER_OF_WORKERS)
		validationLoader = DataLoader(validation, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_embeddings,num_workers=NUMBER_OF_WORKERS)

		optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

		bestValidationAccuracy = 0.0
		trainLosses, trainAccuracies = [], []
		validationLosses, validationAccuracies = [], []

		for epoch in range(EPOCHS):
			trainLoss, trainAccuracy = train_model(model, device, trainLoader, criterion, optimizer, epoch)
			validationLoss, validationAccuracy, _ = test_model(model, device, validationLoader, criterion)

			trainLosses.append(trainLoss)
			validationLosses.append(validationLoss)

			trainAccuracies.append(trainAccuracy)
			validationAccuracies.append(validationAccuracy)

			if validationAccuracy > bestValidationAccuracy:
				bestValidationAccuracy = validationAccuracy
				torch.save(model, os.path.join(PATH_OUTPUT, f"{targetVariable}_GRU.pth"))

		plot_learning_curves(trainLosses, validationLosses, trainAccuracies, validationAccuracies,
							 f"{targetVariable.upper()} GRU")

	if MODE.upper() == "BOTH" or MODE.upper() == "TEST":
		testLoader = DataLoader(test, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_embeddings,num_workers=NUMBER_OF_WORKERS)

		bestModel = torch.load(os.path.join(PATH_OUTPUT, f"{targetVariable}_GRU.pth"))
		_, _, testResults = test_model(bestModel, device, testLoader, criterion)

		y_true, y_pred = zip(*testResults)

		aurocScore = roc_auc_score(y_true, y_pred)
		auprcScore = average_precision_score(y_true, y_pred)
		f1Score = f1_score(y_true, y_pred)

		print('\nFinal Test scores: \n'
			  f'{targetVariable} AUROC Score: {aurocScore}\n'
			  f'{targetVariable} AUPRC Score: {auprcScore}\n'
			  f'{targetVariable} F1 Score: {f1Score}')

		plot_confusion_matrix(y_true, y_pred, ["No", "Yes"], f"{targetVariable.upper()} GRU")


In [12]:
trainSet, validationSet, testSet, embeddings = read_data()
for targetVariable in TARGET_VARIABLES:
	newTrainSet, newValidationSet, newTestSet = dataframe_to_tensorDataset(trainSet, validationSet, testSet,
																			   targetVariable,embeddings)
		
	model_runner(newTrainSet, newValidationSet, newTestSet, targetVariable)



Epoch: [0][0/20]	Time 0.251 (0.251)	Data 0.011 (0.011)	Loss 0.7196 (0.7196)	Accuracy 46.875 (46.875)
Epoch: [0][1/20]	Time 0.120 (0.186)	Data 0.005 (0.008)	Loss 0.6915 (0.7055)	Accuracy 53.125 (50.000)
Epoch: [0][2/20]	Time 0.152 (0.174)	Data 0.015 (0.010)	Loss 0.6989 (0.7033)	Accuracy 50.000 (50.000)
Epoch: [0][3/20]	Time 0.179 (0.175)	Data 0.004 (0.008)	Loss 0.7106 (0.7052)	Accuracy 71.875 (55.469)
Epoch: [0][4/20]	Time 0.057 (0.152)	Data 0.004 (0.008)	Loss 0.6768 (0.6995)	Accuracy 59.375 (56.250)
Epoch: [0][5/20]	Time 0.079 (0.140)	Data 0.002 (0.007)	Loss 0.6766 (0.6957)	Accuracy 65.625 (57.812)
Epoch: [0][6/20]	Time 0.101 (0.134)	Data 0.003 (0.006)	Loss 0.6930 (0.6953)	Accuracy 59.375 (58.036)
Epoch: [0][7/20]	Time 0.123 (0.133)	Data 0.003 (0.006)	Loss 0.6823 (0.6937)	Accuracy 68.750 (59.375)
Epoch: [0][8/20]	Time 0.076 (0.126)	Data 0.002 (0.005)	Loss 0.7328 (0.6980)	Accuracy 62.500 (59.722)
Epoch: [0][9/20]	Time 0.087 (0.123)	Data 0.003 (0.005)	Loss 0.7277 (0.7010)	Accuracy 46.875



Epoch: [0][0/20]	Time 0.081 (0.081)	Data 0.002 (0.002)	Loss 0.7591 (0.7591)	Accuracy 90.625 (90.625)
Epoch: [0][1/20]	Time 0.100 (0.090)	Data 0.009 (0.005)	Loss 0.7636 (0.7614)	Accuracy 84.375 (87.500)
Epoch: [0][2/20]	Time 0.121 (0.101)	Data 0.004 (0.005)	Loss 0.7025 (0.7417)	Accuracy 87.500 (87.500)
Epoch: [0][3/20]	Time 0.100 (0.100)	Data 0.003 (0.004)	Loss 0.7136 (0.7347)	Accuracy 78.125 (85.156)
Epoch: [0][4/20]	Time 0.098 (0.100)	Data 0.002 (0.004)	Loss 0.6669 (0.7211)	Accuracy 96.875 (87.500)
Epoch: [0][5/20]	Time 0.081 (0.097)	Data 0.002 (0.004)	Loss 0.6903 (0.7160)	Accuracy 96.875 (89.062)
Epoch: [0][6/20]	Time 0.080 (0.094)	Data 0.003 (0.003)	Loss 0.6701 (0.7094)	Accuracy 96.875 (90.179)
Epoch: [0][7/20]	Time 0.082 (0.093)	Data 0.002 (0.003)	Loss 0.6309 (0.6996)	Accuracy 93.750 (90.625)
Epoch: [0][8/20]	Time 0.099 (0.093)	Data 0.004 (0.003)	Loss 0.6594 (0.6951)	Accuracy 90.625 (90.625)
Epoch: [0][9/20]	Time 0.101 (0.094)	Data 0.004 (0.003)	Loss 0.5596 (0.6816)	Accuracy 100.00

# Results
In this section, you should finish training your model training or loading your trained model. That is a great experiment! You should share the results with others with necessary metrics and figures.

Please test and report results for all experiments that you run with:

*   specific numbers (accuracy, AUC, RMSE, etc)
*   figures (loss shrinkage, outputs from GAN, annotation or label of sample pictures, etc)


In [None]:
# metrics to evaluate my model

# plot figures to better show the results

# it is better to save the numbers and figures for your presentation.

## Model comparison

In [None]:
# compare you model with others
# you don't need to re-run all other experiments, instead, you can directly refer the metrics/numbers in the paper

# Discussion

In this section,you should discuss your work and make future plan. The discussion should address the following questions:
  * Make assessment that the paper is reproducible or not.
  * Explain why it is not reproducible if your results are kind negative.
  * Describe “What was easy” and “What was difficult” during the reproduction.
  * Make suggestions to the author or other reproducers on how to improve the reproducibility.
  * What will you do in next phase.



In [None]:
# no code is required for this section
'''
if you want to use an image outside this notebook for explanaition,
you can read and plot it here like the Scope of Reproducibility
'''

# References

1.   Sun, J, [paper title], [journal title], [year], [volume]:[issue], doi: [doi link to paper]



# Feel free to add new sections