#Put your Google Colab link here:
*your link here*

## Checklist
Before beginning, please make sure you are connected to a GPU runtime (go to Runtime > Change runtime type > select T4 GPU)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!unzip "/content/drive/Shareddrives/ECE477 datasets/Assignment12/scout.zip" -d "./"

In [None]:
!pip uninstall -y transformers
!pip install --no-cache-dir -e /content/scout/transformers

In [None]:
#verify installation
import sys
sys.path.append('/content/scout/transformers/src/')
import transformers
print(transformers.__file__)
#Expected output: /content/scout/transformers/src/transformers/__init__.py
#Incase of error disconnect and delete runtime (see Runtime > Disconnect and delete runtime)

For this assignment we are going to run a randomized trial using SCouT. Let's start by understanding the problem scenario first.

# Problem Description
The Childhood Asthma Management Program (CAMP) (https://pmc.ncbi.nlm.nih.gov/articles/PMC3546823/) was a clinical trial carried out in children with asthma. The trial was designed to determine the long-term effects of 3 treatments (budesonide, nedocromil, or placebo) on pulmonary function as measured by normalized FEV1 over a 5-6.5 year period. The design of CAMP was a multicenter, masked, placebo-controlled, randomized trial. A total of 1,041 children (311 in the budesonide group, 312 in the nedocromil group and 418 in the placebo group) aged 5-12 years were enrolled between December of 1993 and September of 1995. The primary outcome of the trial was lung function as measured by the Forced Expiratory Volume at 1 second (FEV1).

The trial’s placebo arm contains anonymized longitudinal data of 275 patients with over 20 spirometry measurements per patient. For each donor,
we use several 16 different repiratory physiological signals. We are particularly interested in the Pre-Bronchodilator Forced Expiratory Volume to Forced Vital Capacity (PreFF). PreFF ratio is a vital metric of
lung capacity in asthma patients that measures volume of air that an individual can exhale during a forced breath
prior to the usage of a bronchodilator. Here, we model the control arm of the RCT and predict the PreFF of a
target patient using the other placebos as donors. This means that we are going to answer the counterfactual "what would have happened to the target unit had there been no intervention".

Let's begin by building our dataset first:

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import os
import glob
from datetime import datetime


df = pd.read_csv('/content/scout/datasets/asthma/camp_teach.csv')
#keep control (C) and ID 79 bud (A) for placebo arm
df = df[df['TG'] == 'C']

li_final = []
for idx in df.id.unique():
    df_idx = df[df['id'] == idx]
    dic_list = []
    for date in df.visitc.unique():
        if date not in df_idx.visitc.unique():
            dic = {'id': idx, 'visitc': date}
            dic_list.append(dic)

    rows = pd.DataFrame.from_dict(dic_list)
    df_idx_ = pd.concat([df_idx, rows], ignore_index=True, sort=True)
    li_final.append(df_idx_)

df = pd.concat(li_final)

#extract continuous and discrete features
continuous_features = ['PREFF','age_rz', 'hemog', 'PREFEV', 'PREFVC',  'PREPF', 'POSFEV', 'POSFVC',
                      'POSFF', 'POSPF', 'PREFEVPP', 'PREFVCPP', 'POSFEVPP', 'POSFVCPP', 'wbc', 'agehome']
units = ['id']
time = ['visitc'] #choose months
df = df[units + time + continuous_features] #state, time, continuous features


scaler = MinMaxScaler()
scaler.fit(df.loc[:, df.columns.isin(continuous_features)])
features = scaler.transform(df.loc[:, df.columns.isin(continuous_features)])
df.loc[:, df.columns.isin(continuous_features)] = features

n_units = len(df.id.unique()) #number of units
n_time = len(df.visitc.unique()) #number of time steps
n_units, n_time

print('Number of placebo units:', n_units)
print('Number of visits:', n_time)

In [None]:
#let's visualise the data
df.head()

The above data has missing values as denoted by the 'NaN' values. This is a very real-world medical occurence where patients may or may not get certain tests done after visits or the data may simply be missing /corrupted.

The SCouT framework deals with missing values by applying a low-rank transformation (https://en.wikipedia.org/wiki/Low-rank_approximation) to the spatiotemporal matrix. In order to do that let's replace missing values with zero and create a mask that points to the missing values in the data.

In [None]:
# [state, year, feature value]
df = df.sort_values(by=['id', 'visitc'])
mask_df = np.array(df.isna())
df = df.fillna(0)

arr = df.values.reshape(n_units, n_time, df.shape[1])[:, :, 2:] #remove id and visits
mask = mask_df.reshape(n_units, n_time, df.shape[1])[:, :, 2:]  #remove id and months

print('Data shape:', arr.shape)
print('Mask shape:', mask.shape)

np.save('/content/scout/datasets/asthma/data.npy', arr)
np.save('/content/scout/datasets/asthma/mask.npy', mask)

# SCouT Library Demonstration

For this assignment the SCouT library along with all its trainer and dataloading functionalities have been provided to you.

Lets do a simple demonstration. First, we will explore pretraining SCOuT on a sample target id and pre-intervention length. Then we will finetune SCouT on the target followed by predicting the counterfactual.

In [None]:

from matplotlib import pyplot as plt
import sys
sys.path.append('/content/scout/')
import torch
from dsc.dsc_model import DSCModel
from models.bert2bert import Bert2BertSynCtrl
from transformers import BertConfig
import random

random_seed = 42
device = torch.device('cuda:0' if torch.cuda.is_available else "cpu")
op_path = '/content/scout/logs/'
datapath = '/content/scout/datasets/asthma/'
random.seed(random_seed)
np.random.seed(random_seed)
torch.manual_seed(random_seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(random_seed)
    torch.cuda.manual_seed_all(random_seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

config = {
    'feature_dim': 16,
    'cont_dim' : 16,
    'discrete_dim': 0,
    'hidden_size' : 128,
    'n_layers' : 4,
    'n_heads' : 1,
    'K' : 274,
    'pre_int_len': 2,
    'post_int_len': 2,
    'seq_range' : 275,
    'time_range' : 20,
    'batch_size' : 16,
    'lr': '1e-4',
    'weight_decay' : '1e-4',
    'warmup_steps' : 10000
}

In [None]:

#lets initialize the model with the default params
config_model = BertConfig(hidden_size = config['hidden_size'],
                        num_hidden_layers = config['n_layers'],
                        num_attention_heads = config['n_heads'],
                        intermediate_size = 4*config['hidden_size'],
                        vocab_size = 0,
                        max_position_embeddings = 0,
                        output_hidden_states = True,
                        )
config_model.add_syn_ctrl_config(K=config['K'],
                                pre_int_len=config['pre_int_len'],
                                post_int_len=config['post_int_len'],
                                feature_dim=config['feature_dim'],
                                time_range=config['time_range'],
                                seq_range=config['seq_range'],
                                cont_dim=config['cont_dim'],
                                discrete_dim=config['discrete_dim'],
                                classes = None)
model = Bert2BertSynCtrl(config_model, random_seed)
model = model.to(device)


In [None]:
'''
DSCModel is the main SCouT class that wraps the dataloading, training and model inference capabilties under one wrapper.
The assignment tasks will involve modifying the instantiation of the DSCModel class.
The main parameters that will be needed to be tweaked is:
target_id: Denotes the index of the target index (starting from 0)
'''
target_id = 0
interv_time = 13
dscmodel = DSCModel(model = model,
                    config = config,
                    op_dir = op_path,
                    target_id = 0,
                    interv_time = interv_time,
                    random_seed = random_seed,
                    datapath = datapath,
                    device = device,
                    topk = None,
                    weights = None,
                    lowrank = True,
                    classes=None)

In [None]:
#lets pretrain SCouT first
dscmodel.pretrain(num_iters=3e3)

In [None]:
#lets finetune SCOuT on the target pre-intervention period
dscmodel.finetune(num_iters=3e3)

In [None]:
'''
We calculate the prediction error by
'''
mask = np.load(datapath+'mask.npy')
data = np.load(datapath+'data.npy')
#model prediction
pred = dscmodel.predict()[interv_time:]
ctrl  = data[target_id,interv_time:,0]
dsc_rmse = np.sqrt(np.mean(((pred -ctrl)*(1-mask[target_id,interv_time:,0]))**2))
print(dsc_rmse)

# Task: Analysing the Effect of Pre-Training

SCouT relies on unsupervised pre-training to learn effective transformer embeddings. You have to analyse the effect of pre-training by training SCOuT with and without the pre-training step. This ablation will be performed over three different target units. Basic starter code has been provided to you.

## 1. Traing with / without the pre-training step (10 pts)
You should learn how to use scout library in the previous code cells.

In [None]:
#Fill in the code (10 points)
data = np.load(datapath+'data.npy')
mask = np.load(datapath+'mask.npy')

errors_with_pretraining = []
errors_without_pretraining = []
interv_time = 10
for ablation_id in range(3):
  # With pretraining
  """ TO DO """

  errors_with_pretraining.append(dsc_rmse_with_pretraining)
  print(f"ID {ablation_id}: RMSE with pretraining = {dsc_rmse_with_pretraining}")

  # Without pretraining
  """ TO DO """

  errors_without_pretraining.append(dsc_rmse_without_pretraining)
  print(f"ID {ablation_id}: RMSE without pretraining = {dsc_rmse_without_pretraining}")


## 2. Plot errors (6 pts)

In [None]:
# plot and compare the errors along with error bars (6 points)
# Please average the error over three target ids and plot the std deviation error bars
import matplotlib.pyplot as plt

# compute std and mean
""" TO DO """
std_with_pretraining =
std_without_pretraining =
print(f"Std with pretraining = {std_with_pretraining}")
print(f"Std without pretraining = {std_without_pretraining}")

mean_with_pretraining =
mean_without_pretraining =
print(f"Mean with pretraining = {mean_with_pretraining}")
print(f"Mean without pretraining = {mean_without_pretraining}")

# plot the std error bars, with / without pretraining
# Set labels for better visualization
""" TO DO """


## 3. Give a brief explanation of what trend you observe and why that might be happening? (4 pts).

Your answer here