<a href="https://colab.research.google.com/github/kathariemer/DiscreteEventSimulation_PopulationModel/blob/main/NLP_Project_milestone_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Milestone 2: Deep Learning

### Data Fetching and Preprocessing

In [1]:
import os

if not os.path.isdir("./data"):
    os.mkdir("./data")

import urllib.request

u = urllib.request.URLopener()
u.retrieve(
    "https://raw.githubusercontent.com/CrowdTruth/Medical-Relation-Extraction/master/ground_truth_cause.csv",
    "data/ground_truth_cause.csv",
)
u.retrieve(
    "https://raw.githubusercontent.com/CrowdTruth/Medical-Relation-Extraction/master/ground_truth_treat.csv",
    "data/ground_truth_treat.csv",
)

('data/ground_truth_treat.csv', <http.client.HTTPMessage at 0x7fe19f4daa90>)

We repeat the same preprocessing steps as in milestone 1, i.e. we replace the terms in the sentences with the strings "term1" and "term2" respectively, and compute a numeric label for the cause/treat relationship by taking the expert label wherever available and in the other cases taking the crowd labels.

In [2]:
import pandas as pd
import numpy as np
import re

def get_term(sentence, start, end):
    """ helper function, which returns the entire term that should be replaced """
    match = re.search("[^\w]", sentence[end:])
    true_end = end + match.start() if match else end
    return(sentence[start: true_end])

def replace_terms(df):
  for i in range(0,len(cause)): #change the terms term1 and term2 in each sentence to "term1" and "term2"
    row = df.iloc[i]
    sentence = row["sentence"]
    term1 = get_term(sentence, row.b1, row.e1)
    term2 = get_term(sentence, row.b2, row.e2)
    sentence = sentence.replace(term1,'term1').replace(term2,'term2 ')
    df.at[i, 'sentence'] = sentence


def extract_labels(df):
    expert= df.expert

    crowd = df.crowd
    label = 0 #default label if no other label

    if expert == 1:
        label = 1 
    elif pd.isnull(expert) and crowd > 0:
        label = 1  
    
    return label

Here we apply the preprocessing to the cause-dataset:

In [3]:
#load cause data
cause = pd.read_csv("data/ground_truth_cause.csv")
replace_terms(cause)

cause["label"] = cause.apply(extract_labels, axis=1) 
cause[["SID", "sentence", "label"]].head()


Unnamed: 0,SID,sentence,label
0,100003,"The limited data suggest that, in children wit...",0
1,100039,term1 are associated with difficult behaviors ...,0
2,100079,The term term1 is employed to indicate ataxia ...,1
3,100086,Non hereditary causes of cerebellar degenerati...,1
4,100145,The disorder can present with a migratory ture...,0


And here we do the same for the treat dataset:

In [4]:
#load treat data
treat = pd.read_csv("data/ground_truth_treat.csv")
replace_terms(treat)
treat["label"] = treat.apply(extract_labels, axis=1)
treat[["SID", "sentence", "label"]].head()

Unnamed: 0,SID,sentence,label
0,100003,"The limited data suggest that, in children wit...",0
1,100039,term1 are associated with difficult behaviors ...,0
2,100079,The term term1 is employed to indicate ataxia ...,0
3,100086,Non hereditary causes of cerebellar degenerati...,0
4,100145,The disorder can present with a migratory ture...,0


## Split the datasets into training and validation sets

As in the practical lecture we will reserve split the data 70% training dat and 30% validation data.

In [5]:
import torch

SEED = 1234
TEST_SIZE=0.3

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [6]:
from sklearn.model_selection import train_test_split as split

In [7]:
tr_cause, val_cause = split(cause, test_size=TEST_SIZE, random_state=SEED)

In [8]:
tr_treat, val_treat = split(treat, test_size=TEST_SIZE, random_state=SEED)

## Vector encoding

Next we will use the same steps presented in the lecture to convert the text into one-hot encoded vectors.

First we will map the text - after some additionally preprocessing using *nltk* - to count vectors, using *CountVectorizer*:

In [9]:
from sklearn.feature_extraction.text import CountVectorizer

import nltk

nltk.download("punkt")
nltk.download("wordnet")

from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize


class LemmaTokenizer(object):
    def __init__(self):
        self.wnl = WordNetLemmatizer()

    def __call__(self, articles):
        return [self.wnl.lemmatize(t) for t in word_tokenize(articles)]


def prepare_vectorizer(tr_data):
    vectorizer = CountVectorizer(
        max_features=3000, tokenizer=LemmaTokenizer(), stop_words="english"
    )

    word_to_ix = vectorizer.fit(tr_data.sentence)

    return word_to_ix

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Now let us map the text in the *cause* data:

In [10]:
cause_words_to_ix = prepare_vectorizer(tr_cause)
CAUSE_VOC_SIZE = len(cause_words_to_ix.vocabulary_)
assert CAUSE_VOC_SIZE == 3000

  "The parameter 'token_pattern' will not be used"
  % sorted(inconsistent)


As well as the text in the *treat* data:

In [11]:
treat_words_to_ix = prepare_vectorizer(tr_treat)
TREAT_VOC_SIZE = len(treat_words_to_ix.vocabulary_)
assert TREAT_VOC_SIZE == 3000

  "The parameter 'token_pattern' will not be used"
  % sorted(inconsistent)


As advised in the lecture, next we make sure that all the arrays are on the same device:

In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In the next step we will map the data with the help of the respective count vectors to one-hot encoded vectors:

In [13]:
def prepare_dataloader(tr_data, val_data, word_to_ix):
    tr_data_vecs = torch.FloatTensor(word_to_ix.transform(tr_data.sentence).toarray()).to(
        device
    )
    tr_labels = torch.LongTensor(tr_data.label.tolist()).to(device)

    val_data_vecs = torch.FloatTensor(
        word_to_ix.transform(val_data.sentence).toarray()
    ).to(device)
    val_labels = torch.LongTensor(val_data.label.tolist()).to(device)

    tr_data_loader = [(sample, label) for sample, label in zip(tr_data_vecs, tr_labels)]
    val_data_loader = [
        (sample, label) for sample, label in zip(val_data_vecs, val_labels)
    ]

    return tr_data_loader, val_data_loader

Let us apply this to our *cause* data:

In [14]:
cause_tr_loader, cause_val_loader = prepare_dataloader(tr_cause, val_cause, cause_words_to_ix)

As well as the *treat* data:

In [15]:
treat_tr_loader, treat_val_loader = prepare_dataloader(tr_treat, val_treat, treat_words_to_ix)

Then we initialize *DataLoader* objects for both lists:

In [16]:
from torch.utils.data import DataLoader

def create_dataloader_iterators(tr_data_loader, val_data_loader, BATCH_SIZE):
    train_iterator = DataLoader(
        tr_data_loader,
        batch_size=BATCH_SIZE,
        shuffle=True,
    )

    valid_iterator = DataLoader(
        val_data_loader,
        batch_size=BATCH_SIZE,
        shuffle=False,
    )

    return train_iterator, valid_iterator

We set the batch size and initialize the *DataLoader* object for the vectors, we computed from the *cause* and *treat* data:

In [17]:
# Batch size may be modified
BATCH_SIZE = 64

In [18]:
cause_train_iterator, cause_valid_iterator = create_dataloader_iterators(
    cause_tr_loader, cause_val_loader, BATCH_SIZE
)
assert type(cause_train_iterator) == torch.utils.data.dataloader.DataLoader

In [19]:
treat_train_iterator, treat_valid_iterator = create_dataloader_iterators(
    treat_tr_loader, treat_val_loader, BATCH_SIZE
)
assert type(treat_train_iterator) == torch.utils.data.dataloader.DataLoader

# Building a neural network

In [20]:
from torch import nn

class BoWDeepClassifier(nn.Module):
    def __init__(self, num_labels, vocab_size, hidden_size):
        super(BoWDeepClassifier, self).__init__()
        self.linear = nn.Linear(vocab_size, num_labels)

        """ 
        Below is the extension of the neural network to more layers
        but it didn't quite work
        """
        # First linear layer
        #self.linear1 = nn.Linear(vocab_size, hidden_size)
        ## Non-linear activation function between them
        #self.relu = torch.nn.ReLU()
        ## Second layer
        # self.linear2 = nn.Linear(hidden_size, num_labels)

    def forward(self, bow_vec, sequence_lens):
        output = self.linear(bow_vec)
        
        """ 
        Below is the extension of the neural network to more layers
        but it didn't quite work
        """
        ## Run the input vector through every layer
        #output = self.linear1(bow_vec)
        # output = self.relu(output)
        # output = self.linear2(output)

        # Get the probabilities
        return F.log_softmax(output, dim=1)

In [21]:
# Size of intermediate representation between linear layers
HIDDEN_SIZE = 200
# We have only 2 output classes
OUTPUT_DIM = 2
LEARNING_RATE = 0.001

In [22]:
CAUSE_INPUT_DIM = CAUSE_VOC_SIZE
cause_model = BoWDeepClassifier(OUTPUT_DIM, CAUSE_INPUT_DIM, HIDDEN_SIZE)

In [23]:
TREAT_INPUT_DIM = TREAT_VOC_SIZE
treat_model = BoWDeepClassifier(OUTPUT_DIM, TREAT_INPUT_DIM, HIDDEN_SIZE)

In [24]:
import torch.optim as optim

In [25]:
tr_cause.groupby("label").size()

label
0    1794
1     994
dtype: int64

In [26]:
cause_optimizer = optim.Adam(cause_model.parameters(), lr=LEARNING_RATE)

# Handling class imbalance
cause_weights = torch.Tensor([1, 2])
cause_criterion = nn.NLLLoss(weight=cause_weights)

cause_model = cause_model.to(device)
cause_criterion = cause_criterion.to(device)

assert cause_model.linear.out_features == 2

In [27]:
tr_treat.groupby("label").size()

label
0    1793
1     995
dtype: int64

In [28]:
treat_optimizer = optim.Adam(treat_model.parameters(), lr=LEARNING_RATE)

# Handling class imbalance
treat_weights = torch.Tensor([1, 2])
treat_criterion = nn.NLLLoss(weight=treat_weights)

treat_model = treat_model.to(device)
treat_criterion = treat_criterion.to(device)

assert treat_model.linear.out_features == 2

# Training and evaluating the models

In [29]:
from sklearn.metrics import precision_recall_fscore_support
import torch.nn.functional as F
import time


def calculate_performance(preds, y):
    """
    Returns precision, recall, fscore per batch
    """
    # Get the predicted label from the probabilities
    rounded_preds = preds.argmax(1)

    # Calculate the correct predictions batch-wise and calculate precision, recall, and fscore
    # WARNING: Tensors here could be on the GPU, so make sure to copy everything to CPU
    precision, recall, fscore, support = precision_recall_fscore_support(
        rounded_preds.cpu(), y.cpu()
    )

    return precision[1], recall[1], fscore[1]

def train(model, iterator, optimizer, criterion):
    # We will calculate loss and accuracy epoch-wise based on average batch accuracy
    epoch_loss = 0
    epoch_prec = 0
    epoch_recall = 0
    epoch_fscore = 0

    # You always need to set your model to training mode
    # If you don't set your model to training mode the error won't propagate back to the weights
    model.train()

    # We calculate the error on batches so the iterator will return matrices with shape [BATCH_SIZE, VOCAB_SIZE]
    for batch in iterator:
        text_vecs = batch[0]
        labels = batch[1]
        sen_lens = []
        texts = []

        # This is for later!
        if len(batch) > 2:
            sen_lens = batch[2]
            texts = batch[3]

        # We reset the gradients from the last step, so the loss will be calculated correctly (and not added together)
        optimizer.zero_grad()

        # This runs the forward function on your model (you don't need to call it directly)
        predictions = model(text_vecs, sen_lens)

        # Calculate the loss and the accuracy on the predictions (the predictions are log probabilities, remember!)
        loss = criterion(predictions, labels)

        prec, recall, fscore = calculate_performance(predictions, labels)

        # Propagate the error back on the model (this means changing the initial weights in your model)
        # Calculate gradients on parameters that requries grad
        loss.backward()
        # Update the parameters
        optimizer.step()

        # We add batch-wise loss to the epoch-wise loss
        epoch_loss += loss.item()
        # We also do the same with the scores
        epoch_prec += prec.item()
        epoch_recall += recall.item()
        epoch_fscore += fscore.item()
    return (
        epoch_loss / len(iterator),
        epoch_prec / len(iterator),
        epoch_recall / len(iterator),
        epoch_fscore / len(iterator),
    )

def evaluate(model, iterator, criterion):

    epoch_loss = 0
    epoch_prec = 0
    epoch_recall = 0
    epoch_fscore = 0
    # On the validation dataset we don't want training so we need to set the model on evaluation mode
    model.eval()

    # Also tell Pytorch to not propagate any error backwards in the model or calculate gradients
    # This is needed when you only want to make predictions and use your model in inference mode!
    with torch.no_grad():

        # The remaining part is the same with the difference of not using the optimizer to backpropagation
        for batch in iterator:
            text_vecs = batch[0]
            labels = batch[1]
            sen_lens = []
            texts = []

            if len(batch) > 2:
                sen_lens = batch[2]
                texts = batch[3]

            predictions = model(text_vecs, sen_lens)
            loss = criterion(predictions, labels)

            prec, recall, fscore = calculate_performance(predictions, labels)

            epoch_loss += loss.item()
            epoch_prec += prec.item()
            epoch_recall += recall.item()
            epoch_fscore += fscore.item()

    # Return averaged loss on the whole epoch!
    return (
        epoch_loss / len(iterator),
        epoch_prec / len(iterator),
        epoch_recall / len(iterator),
        epoch_fscore / len(iterator),
    )

# This is just for measuring training time!
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

def training_loop(model, train_iterator, valid_iterator, optimizer, criterion, epoch_number=15):
    # Set an EPOCH number!
    N_EPOCHS = epoch_number

    best_valid_loss = float("inf")

    # We loop forward on the epoch number
    for epoch in range(N_EPOCHS):

        start_time = time.time()

        # Train the model on the training set using the dataloader
        train_loss, train_prec, train_rec, train_fscore = train(
            model, train_iterator, optimizer, criterion
        )
        # And validate your model on the validation set
        valid_loss, valid_prec, valid_rec, valid_fscore = evaluate(
            model, valid_iterator, criterion
        )

        end_time = time.time()

        epoch_mins, epoch_secs = epoch_time(start_time, end_time)

        # If we find a better model, we save the weights so later we may want to reload it
        if valid_loss < best_valid_loss:
            best_valid_loss = valid_loss
            torch.save(model.state_dict(), "tut1-model.pt")

        print(f"Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s")
        print(
            f"\tTrain Loss: {train_loss:.3f} | Train Prec: {train_prec*100:.2f}% | Train Rec: {train_rec*100:.2f}% | Train Fscore: {train_fscore*100:.2f}%"
        )
        print(
            f"\t Val. Loss: {valid_loss:.3f} |  Val Prec: {valid_prec*100:.2f}% | Val Rec: {valid_rec*100:.2f}% | Val Fscore: {valid_fscore*100:.2f}%"
        )

In [30]:
EPOCH_NUMBER = 15

In [31]:
training_loop(cause_model, cause_train_iterator, cause_valid_iterator, cause_optimizer, cause_criterion)

Epoch: 01 | Epoch Time: 0m 0s
	Train Loss: 0.676 | Train Prec: 88.38% | Train Rec: 43.51% | Train Fscore: 57.74%
	 Val. Loss: 0.656 |  Val Prec: 88.14% | Val Rec: 50.38% | Val Fscore: 63.41%
Epoch: 02 | Epoch Time: 0m 0s
	Train Loss: 0.627 | Train Prec: 91.09% | Train Rec: 56.71% | Train Fscore: 69.54%
	 Val. Loss: 0.631 |  Val Prec: 88.31% | Val Rec: 51.50% | Val Fscore: 64.45%
Epoch: 03 | Epoch Time: 0m 0s
	Train Loss: 0.589 | Train Prec: 90.76% | Train Rec: 59.06% | Train Fscore: 71.17%
	 Val. Loss: 0.614 |  Val Prec: 84.75% | Val Rec: 53.41% | Val Fscore: 64.97%
Epoch: 04 | Epoch Time: 0m 0s
	Train Loss: 0.559 | Train Prec: 92.41% | Train Rec: 60.02% | Train Fscore: 72.35%
	 Val. Loss: 0.600 |  Val Prec: 85.40% | Val Rec: 53.61% | Val Fscore: 65.33%
Epoch: 05 | Epoch Time: 0m 0s
	Train Loss: 0.534 | Train Prec: 92.53% | Train Rec: 63.21% | Train Fscore: 74.65%
	 Val. Loss: 0.593 |  Val Prec: 83.66% | Val Rec: 55.84% | Val Fscore: 66.43%
Epoch: 06 | Epoch Time: 0m 0s
	Train Loss: 0.

In [32]:
training_loop(treat_model, treat_train_iterator, treat_valid_iterator, treat_optimizer, treat_criterion)

Epoch: 01 | Epoch Time: 0m 0s
	Train Loss: 0.658 | Train Prec: 94.09% | Train Rec: 47.26% | Train Fscore: 61.94%
	 Val. Loss: 0.629 |  Val Prec: 87.66% | Val Rec: 58.94% | Val Fscore: 70.15%
Epoch: 02 | Epoch Time: 0m 0s
	Train Loss: 0.584 | Train Prec: 92.68% | Train Rec: 68.03% | Train Fscore: 78.18%
	 Val. Loss: 0.582 |  Val Prec: 83.35% | Val Rec: 66.76% | Val Fscore: 73.77%
Epoch: 03 | Epoch Time: 0m 0s
	Train Loss: 0.532 | Train Prec: 89.59% | Train Rec: 76.70% | Train Fscore: 82.33%
	 Val. Loss: 0.549 |  Val Prec: 80.67% | Val Rec: 70.82% | Val Fscore: 75.11%
Epoch: 04 | Epoch Time: 0m 0s
	Train Loss: 0.493 | Train Prec: 89.65% | Train Rec: 77.91% | Train Fscore: 83.14%
	 Val. Loss: 0.524 |  Val Prec: 80.17% | Val Rec: 71.11% | Val Fscore: 75.09%
Epoch: 05 | Epoch Time: 0m 0s
	Train Loss: 0.460 | Train Prec: 89.78% | Train Rec: 79.54% | Train Fscore: 84.14%
	 Val. Loss: 0.504 |  Val Prec: 79.63% | Val Rec: 73.61% | Val Fscore: 76.25%
Epoch: 06 | Epoch Time: 0m 0s
	Train Loss: 0.