# Model Training and Evaluation with BERT

After ETL and text-processing, we continue to model training and evaluation. 

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)

This script will only train one of the 5 BERT models. The INDEX_SPECIFIED parameter will indicate the index of the model as well as the text inputs to extract from the input texts. For each iteration, you'll have to run preprocessing once and this notebook. I'll have a separate evaluate notebook to combine the results together.

In [39]:
INDEX_SPECIFIED = 4 

INDEX_L = int(45616*INDEX_SPECIFIED)
INDEX_R = int(45616*(INDEX_SPECIFIED+1))

INDEX_L_VAL = int(11404*INDEX_SPECIFIED)
INDEX_R_VAL = int(11404*(INDEX_SPECIFIED+1))

SAVE_MODEL_NAME = 'model_fifth500only1epoch.bert'


Unnamed: 0,TEXT
8153,
51131,
55065,
7368,
29489,


We will utilize NLTK's stopwords library.

In [61]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/yaolong/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

The necessary imports for our BERT model to work.

In [62]:
import re
from tqdm import tqdm
import transformers
%matplotlib inline

from sklearn.model_selection import train_test_split
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
words = stopwords.words("english")
lemma = nltk.stem.WordNetLemmatizer()


import torch
from transformers import BertTokenizer
from transformers import RobertaTokenizer
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import torch.nn as nn
import torch.nn.functional as F
from transformers import BertModel,RobertaModel
from transformers import AdamW, get_linear_schedule_with_warmup

import random
import time

MAX_LEN = 500

The standard preprocessing steps needed for BERT. We need to perform text-preprocessing which was already done earlier but I do it again and also preprocessing specific to BERT to tokenize the input data.

In [63]:
#Load the Bert tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased",do_lower_case=True)
# Create a funcition to tokenize a set of text

def preprocessing_for_bert(data):
    """Perform required preprocessing steps for pretrained BERT.
    @param    data (np.array): Array of texts to be processed.
    @return   input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
    @return   attention_masks (torch.Tensor): Tensor of indices specifying which
                  tokens should be attended to by the model.
    """
    # create empty lists to store outputs
    input_ids = []
    attention_masks = []
    
    #for every sentence...
    
    for sent in data:
        # 'encode_plus will':
        # (1) Tokenize the sentence
        # (2) Add the `[CLS]` and `[SEP]` token to the start and end
        # (3) Truncate/Pad sentence to max length
        # (4) Map tokens to their IDs
        # (5) Create attention mask
        # (6) Return a dictionary of outputs
        encoded_sent = tokenizer.encode_plus(
            text = text_preprocessing(sent),   #preprocess sentence
            add_special_tokens = True,         #Add `[CLS]` and `[SEP]`
            max_length= MAX_LEN  ,             #Max length to truncate/pad
            pad_to_max_length = True,          #pad sentence to max length 
            return_attention_mask= True        #Return attention mask 
        )
        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))
        
    #convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)
    
    return input_ids,attention_masks

def text_preprocessing(text):
    """
    - Remove entity mentions (eg. '@united')
    - Correct errors (eg. '&amp;' to '&')
    @param    text (str): a string to be processed.
    @return   text (Str): the processed string.
    """
    
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"won't", "will not ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "can not ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'scuse", " excuse ", text)
    text = re.sub(r"\'\n", " ", text)
    text = re.sub(r"-", " ", text)
    text = re.sub(r"\'\xa0", " ", text)
    text = re.sub('\s+', ' ', text)
    text = ''.join(c for c in text if not c.isnumeric())
    
    # Remove '@name'
    text = re.sub(r'(@.*?)[\s]', ' ', text)

    # Replace '&amp;' with '&'
    text = re.sub(r'&amp;', '&', text)

    # Remove trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

We obtain the training and validation inputs and masks required for our BERT input.

In [66]:
# Run function 'preprocessing_for_bert' on the train set and validation set
print('Tokenizing data...')
train_inputs, train_masks = preprocessing_for_bert(X_train['TEXT'])
val_inputs, val_masks = preprocessing_for_bert(X_val['TEXT'])

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Tokenizing data...




We'll need to convert the labels to torch tensors.

In [67]:
# Convert other data types to torch.Tensor
train_labels = torch.tensor(y_train.values)
val_labels = torch.tensor(y_val.values)

## For fine-tuning Bert, the authors recommmend a batch size of 16 or 32
batch_size = 4

We see that there are actually 45,616 entries in our training dataset.

In [68]:
torch.set_printoptions(profile='full')

len(train_inputs)

45616

We observe the size of the training input, masks and labels.

In [70]:
train_inputs.size()

torch.Size([45616, 500])

In [71]:
train_masks.size()

torch.Size([45616, 500])

In [72]:
#torch.set_printoptions(profile='full')
train_labels.size()

torch.Size([45616, 50])

We jave to create the DataLoader for our training and validation dataset.

In [73]:
# Create the DataLoader for our training set
train_data = TensorDataset(train_inputs,train_masks, train_labels)
#train_sampler = RandomSampler(train_data)
#train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
train_dataloader = DataLoader(train_data, batch_size=batch_size)

# Create the DataLoader for our validation set
val_data = TensorDataset(val_inputs, val_masks, val_labels)
#val_sampler = SequentialSampler(val_data)
#val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)
val_dataloader = DataLoader(val_data, batch_size=batch_size)

We observe the length of the train_dataloader, train_data, val_dat and val_dataloader.

In [74]:
len(train_dataloader)

11404

In [75]:
len(train_data)

45616

In [76]:
len(val_data)

11404

In [77]:
len(val_dataloader)

2851

The following is the boiler code of BERT model for classification tasks.

In [78]:
%%time
# Create the BertClassifier class

class BertClassifier(nn.Module):
    """
        Bert Model for classification Tasks.
    """
    def __init__(self, freeze_bert=False):
        """
        @param   bert: a BertModel object
        @param   classifier: a torch.nn.Module classifier
        @param   freeze_bert (bool): Set `False` to fine_tune the Bert model
        """
        super(BertClassifier,self).__init__()
        # Specify hidden size of Bert, hidden size of our classifier, and number of labels
        D_in, H,D_out = 768, 30, 50
        
#         self.bert = RobertaModel.from_pretrained('roberta-base')
        self.bert = BertModel.from_pretrained("bert-base-uncased")
        
        self.classifier = nn.Sequential(
                            nn.Linear(D_in, H),
                            nn.ReLU(),
                            nn.Linear(H, D_out))
        self.sigmoid = nn.Sigmoid()
        # Freeze the Bert Model
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False
    
    def forward(self,input_ids,attention_mask):
        """
        Feed input to BERT and the classifier to compute logits.
        @param    input_ids (torch.Tensor): an input tensor with shape (batch_size,
                      max_length)
        @param    attention_mask (torch.Tensor): a tensor that hold attention mask
                      information with shape (batch_size, max_length)
        @return   logits (torch.Tensor): an output tensor with shape (batch_size,
                      num_labels)
        """
        outputs = self.bert(input_ids=input_ids,
                           attention_mask = attention_mask)
        
        # Extract the last hidden state of the token `[CLS]` for classification task
        last_hidden_state_cls = outputs[0][:,0,:]
        
        # Feed input to classifier to compute logits
        logit = self.classifier(last_hidden_state_cls)
        
#         logits = self.sigmoid(logit)
        
        return logit

CPU times: user 311 µs, sys: 9 µs, total: 320 µs
Wall time: 35 µs


With the following function, we initialize our BERT model with the optimizer and learning rate scheduler.

In [79]:
def initialize_model(epochs=4):
    """Initialize the Bert Classifier, the optimizer and the learning rate scheduler.
    """
    
    # Instantiate Bert Classifier
    bert_classifier = BertClassifier(freeze_bert=False)
    
    bert_classifier.to(device)
    
    # Create the optimizer
    optimizer = AdamW(bert_classifier.parameters(),
                     lr=5e-5, #Default learning rate
                     eps=1e-8 #Default epsilon value
                     )
    
    # Total number of training steps
    total_steps = len(train_dataloader) * epochs
    
    # Set up the learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(optimizer, 
                                              num_warmup_steps=0, # Default value
                                              num_training_steps=total_steps)
    return bert_classifier, optimizer, scheduler

We make use of the same utility function as LR/CNN/BiGRU/CAML.

In [80]:
import numpy as np
import pandas as pd
from collections import defaultdict
import csv
import os
import time
import torch
from sklearn.metrics import accuracy_score, f1_score
from sklearn.metrics import roc_curve, auc

# Helper class that I found online, it's pretty good. Just computes the average. 
class AverageMeter(object):
    """Computes and stores the average and current value"""

    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/
def union_size(yhat, y, axis):
    # axis=0 for label-level union (macro). axis=1 for instance-level
    return np.logical_or(yhat, y).sum(axis=axis).astype(float)

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/
def intersect_size(yhat, y, axis):
    # axis=0 for label-level union (macro). axis=1 for instance-level
    return np.logical_and(yhat, y).sum(axis=axis).astype(float)

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/evaluation.py
def macro_accuracy(yhat, y):
    num = intersect_size(yhat, y, 0) / (union_size(yhat, y, 0) + 1e-10)
    return np.mean(num)

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/evaluation.py
def macro_precision(yhat, y):
    num = intersect_size(yhat, y, 0) / (yhat.sum(axis=0) + 1e-10)
    return np.mean(num)

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/evaluation.py
def macro_recall(yhat, y):
    num = intersect_size(yhat, y, 0) / (y.sum(axis=0) + 1e-10)
    return np.mean(num)

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/evaluation.py
def macro_f1(yhat, y):
    prec = macro_precision(yhat, y)
    rec = macro_recall(yhat, y)
    if prec + rec == 0:
        f1 = 0.
    else:
        f1 = 2 * (prec * rec) / (prec + rec)
    return f1

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/evaluation.py
def micro_accuracy(yhatmic, ymic):
    return intersect_size(yhatmic, ymic, 0) / union_size(yhatmic, ymic, 0)

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/evaluation.py
def micro_precision(yhatmic, ymic):
    return intersect_size(yhatmic, ymic, 0) / yhatmic.sum(axis=0)

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/evaluation.py
def micro_recall(yhatmic, ymic):
    return intersect_size(yhatmic, ymic, 0) / ymic.sum(axis=0)

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/evaluation.py
def micro_f1(yhatmic, ymic):
    prec = micro_precision(yhatmic, ymic)
    rec = micro_recall(yhatmic, ymic)
    if prec + rec == 0:
        f1 = 0.
    else:
        f1 = 2 * (prec * rec) / (prec + rec)
    return f1

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/evaluation.py
def inst_precision(yhat, y):
    num = intersect_size(yhat, y, 1) / yhat.sum(axis=1)
    # correct for divide-by-zeros
    num[np.isnan(num)] = 0.
    return np.mean(num)

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/evaluation.py
def inst_recall(yhat, y):
    num = intersect_size(yhat, y, 1) / y.sum(axis=1)
    # correct for divide-by-zeros
    num[np.isnan(num)] = 0.
    return np.mean(num)

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/evaluation.py
def inst_f1(yhat, y):
    prec = inst_precision(yhat, y)
    rec = inst_recall(yhat, y)
    f1 = 2 * (prec * rec) / (prec + rec)
    return f1

#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/evaluation.py
def precision_at_k(yhat_raw, y, k):
    # num true labels in top k predictions / k
    sortd = np.argsort(yhat_raw)[:, ::-1]
    topk = sortd[:, :k]

    # get precision at k for each example
    vals = []
    for i, tk in enumerate(topk):
        if len(tk) > 0:
            num_true_in_top_k = y[i, tk].sum()
            denom = len(tk)
            vals.append(num_true_in_top_k / float(denom))

    return np.mean(vals)


#  referenced from James Mullenbach, https://github.com/jamesmullenbach/caml-mimic/evaluation.py
def auc_metrics(yhat_raw, y, ymic):
    if yhat_raw.shape[0] <= 1:
        return
    fpr = {}
    tpr = {}
    roc_auc = {}
    # get AUC for each label individually
    relevant_labels = []
    auc_labels = {}
    for i in range(y.shape[1]):
        # only if there are true positives for this label
        if y[:, i].sum() > 0:
            fpr[i], tpr[i], _ = roc_curve(y[:, i], yhat_raw[:, i])
            if len(fpr[i]) > 1 and len(tpr[i]) > 1:
                auc_score = auc(fpr[i], tpr[i])
                if not np.isnan(auc_score):
                    auc_labels["auc_%d" % i] = auc_score
                    relevant_labels.append(i)

    # macro-AUC: just average the auc scores
    aucs = []
    for i in relevant_labels:
        aucs.append(auc_labels['auc_%d' % i])
    roc_auc['auc_macro'] = np.mean(aucs)

    # micro-AUC: just look at each individual prediction
    yhatmic = yhat_raw.ravel()
    fpr["micro"], tpr["micro"], _ = roc_curve(ymic, yhatmic)
    roc_auc["auc_micro"] = auc(fpr["micro"], tpr["micro"])

    return roc_auc

We specify the loss function as well as the seed. The train and evaluate functions of the BERT model. This is modified from a standard template of BERT for classification tasks. The evalute function will also save the predictions as well as the actual labels in separate files.

In [81]:
# Specify loss function
#loss_fn = nn.CrossEntropyLoss()
loss_fn = nn.BCEWithLogitsLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, train_dataloader, val_dataloader=None, epochs=4, evaluation=False, train=True, save='model.best'):
    """Train the BertClassifier model.
    """
    # Start training loop
    if train==True:
        print("Start training...\n")
        for epoch_i in range(epochs):
            # =======================================
            #               Training
            # =======================================
            # Print the header of the result table
            print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
            print("-"*70)

            # Measure the elapsed time of each epoch
            t0_epoch, t0_batch = time.time(), time.time()

            # Reset tracking variables at the beginning of each epoch
            total_loss, batch_loss, batch_counts = 0, 0, 0

            # Put the model into the training mode
            model.train()

            # For each batch of training data...
            for step, batch in enumerate(train_dataloader):
                batch_counts +=1
                # Load batch to GPU
                b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)
                  
                # Zero out any previously calculated gradients
                model.zero_grad()

                # Perform a forward pass. This will return logits.
                logits = model(b_input_ids, b_attn_mask)

                # Compute loss and accumulate the loss values
                loss = loss_fn(logits, b_labels.float())
                batch_loss += loss.item()
                total_loss += loss.item()

                # Perform a backward pass to calculate gradients
                loss.backward()

                # Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

                # Update parameters and the learning rate
                optimizer.step()
                scheduler.step()

                # Print the loss values and time elapsed for every 20--50000 batches
                if (step % 50000 == 0 and step != 0) or (step == len(train_dataloader) - 1):
                    # Calculate time elapsed for 20 batches
                    time_elapsed = time.time() - t0_batch

                    # Print training results
                    print(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {time_elapsed:^9.2f}")

                    # Reset batch tracking variables
                    batch_loss, batch_counts = 0, 0
                    t0_batch = time.time()

            # Calculate the average loss over the entire training data
            avg_train_loss = total_loss / len(train_dataloader)

            print("-"*70)
            
        torch.save(model, save)
        print("Training complete!")
    else:
        model = torch.load(save)

    # =======================================
    #               Evaluation
    # =======================================
    if evaluation == True:
        # After the completion of each training epoch, measure the model's performance
        # on our validation set.
        val_loss, val_accuracy = evaluate(model, val_dataloader)

        # Print performance over the entire training data
        #time_elapsed = time.time() - t0_epoch
        print(f"{val_loss:^10.6f} | {val_accuracy:^9.2f}")    
        #print(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
        #print("-"*70)
    #print("\n")

    # Always return model after training
    #return model



def evaluate(model, val_dataloader, to_save=True):
    """After the completion of each training epoch, measure the model's performance
    on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # Prediction variables
    y_hats = []
    y_all = []
    yhat_raws = []
    
    # For each batch in our validation set...
    for idx, batch in enumerate(val_dataloader):
        # Load batch to GPU
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

        #print(b_attn_mask.size())
        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)
        #print(b_labels)
        #print(logits)
        #print(b_input_ids[:,0:500])
        #print(b_attn_mask[:,0:500])
        # Compute loss
        loss = loss_fn(logits, b_labels.float())
        val_loss.append(loss.item())
        #print(loss)
        # Get the predictions
        #preds = torch.argmax(logits, dim=1).flatten()
        #print(preds)
        #break
        # Calculate the accuracy rate
        #accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        accuracy = accuracy_thresh(logits.view(-1,50),b_labels.view(-1,50))
        
        val_accuracy.append(accuracy)
        
        # Calculate y, yhat, yhat_raw
        y = b_labels.cpu().detach().numpy()
        y_all.append(y)
        yhat = torch.sigmoid(logits).cpu().detach().round().numpy()
        y_hats.append(yhat)
        yhat_raw = torch.sigmoid(logits).cpu().detach().numpy()
        yhat_raws.append(yhat_raw)        

    y_hats = np.concatenate(y_hats, axis=0)
    y_all = np.concatenate(y_all, axis=0)
    yhat_raws = np.concatenate(yhat_raws, axis=0)
    
    from numpy import savetxt
    if to_save == True:
        if INDEX_SPECIFIED == 0:
            savetxt('y_hat_0.csv', y_hats, delimiter=',')
            savetxt('y_all_0.csv', y_all, delimiter=',')
            savetxt('yhat_raws_0.csv', yhat_raws, delimiter=',')
        elif INDEX_SPECIFIED == 1:
            savetxt('y_hat_500.csv', y_hats, delimiter=',')
            savetxt('y_all_500.csv', y_all, delimiter=',')
            savetxt('yhat_raws_500.csv', yhat_raws, delimiter=',')
        elif INDEX_SPECIFIED == 2:
            savetxt('y_hat_1000.csv', y_hats, delimiter=',')
            savetxt('y_all_1000.csv', y_all, delimiter=',')
            savetxt('yhat_raws_1000.csv', yhat_raws, delimiter=',')
        elif INDEX_SPECIFIED == 3:
            savetxt('y_hat_1500.csv', y_hats, delimiter=',')
            savetxt('y_all_1500.csv', y_all, delimiter=',')
            savetxt('yhat_raws_1500.csv', yhat_raws, delimiter=',')
        elif INDEX_SPECIFIED == 4:
            savetxt('y_hat_2000.csv', y_hats, delimiter=',')
            savetxt('y_all_2000.csv', y_all, delimiter=',')
            savetxt('yhat_raws_2000.csv', yhat_raws, delimiter=',')

        #print('y_hats')
    #print(y_hats.shape)
    num_rows = int(y_hats.shape[0])
    #num_cols = int(y_hats.shape[1])
    #print(num_rows)
    #print(num_cols)
    num_rows_batch_1 = int(y_hats.shape[0]/5)
    num_rows_batch_2 = int(y_hats.shape[0]/5*2)
    num_rows_batch_3 = int(y_hats.shape[0]/5*3)
    num_rows_batch_4 = int(y_hats.shape[0]/5*4)
    num_rows_batch_5 = int(y_hats.shape[0]/5*5)
    
    y_hats = y_hats[0:num_rows_batch_1,:] #+ y_hats[num_rows_batch_1:num_rows_batch_2,:] + y_hats[num_rows_batch_2:num_rows_batch_3,:] + y_hats[num_rows_batch_3:num_rows_batch_4,:] + y_hats[num_rows_batch_4:num_rows,:]
    y_all = y_all[0:num_rows_batch_1,:] #+ y_all[num_rows_batch_1:num_rows_batch_2,:] + y_all[num_rows_batch_2:num_rows_batch_3,:] + y_all[num_rows_batch_3:num_rows_batch_4,:] + y_all[num_rows_batch_4:num_rows,:]
    yhat_raws = yhat_raws[0:num_rows_batch_1,:] #+ yhat_raws[num_rows_batch_1:num_rows_batch_2,:] + yhat_raws[num_rows_batch_2:num_rows_batch_3,:] + yhat_raws[num_rows_batch_3:num_rows_batch_4,:] + yhat_raws[num_rows_batch_4:num_rows,:]
    
    #y_hats[y_hats > 1] = 1
    #y_all[y_all > 1] = 1
    #yhat_raws[yhat_raws > 1] = 1
    
    mac_acc = macro_accuracy(y_hats, y_all)
    mac_rec = macro_recall(y_hats, y_all)
    mac_pre = macro_precision(y_hats, y_all)
    mic_acc = micro_accuracy(y_hats.ravel(), y_all.ravel())
    mic_rec = micro_recall(y_hats.ravel(), y_all.ravel())
    mic_pre = micro_precision(y_hats.ravel(), y_all.ravel())
    mic_f1 = micro_f1(y_hats.ravel(), y_all.ravel())
    mac_f1 = macro_f1(y_hats, y_all)
    auc_dict = auc_metrics(yhat_raws, y_all, y_all.ravel())
    prec_at_5 = precision_at_k(yhat_raws, y_all, 5)
    prec_at_8 = precision_at_k(yhat_raws, y_all, 8)
    prec_at_15 = precision_at_k(yhat_raws, y_all, 15)

    print('Loss {loss:.4f} \t'
          'Macro Accuracy {mac_acc:.3f} \t'
          'Macro Recall {mac_rec:.3f} \t'
          'Macro Precision {mac_pre:.3f} \t'
          'Macro F1 {mac_f1:.3f} \t'
          'Macro AUC {mac_auc:.3f} \t'
          'Micro Accuracy {mic_acc:.3f} \t'
          'Micro Recall {mic_rec:.3f} \t'
          'Micro Precision {mic_pre:.3f} \t'
          'Micro F1 {mic_f1:.3f} \t'
          'Micro AUC {auc_micro:.3f} \t'
          'P@5 {prec_at_5:.3f} \t'
          'P@8 {prec_at_8:.3f} \t'
          'P@15 {prec_at_15:.3f} \t'.format(
        len(val_dataloader),
        loss=np.mean(val_loss),
        mac_acc=mac_acc,
        mac_pre=mac_pre,
        mac_rec=mac_rec,
        mac_f1=mac_f1,
        mac_auc=auc_dict['auc_macro'],
        mic_acc=mic_acc,
        mic_pre=mic_pre,
        mic_rec=mic_rec,
        mic_f1=mic_f1,
        auc_micro=auc_dict['auc_micro'],
        prec_at_5=prec_at_5,
        prec_at_8=prec_at_8,
        prec_at_15=prec_at_15))

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

def accuracy_thresh(y_pred, y_true, thresh:float=0.5, sigmoid:bool=True):
    "Compute accuracy when `y_pred` and `y_true` are the same size."
    if sigmoid: 
        y_pred = y_pred.sigmoid()
    return ((y_pred>thresh)==y_true.byte()).float().mean().item()
    #return np.mean(((y_pred>thresh).float()==y_true.float()).float().cpu().numpy(), axis=1).sum()

We check for cuda availability and make use of cuda to speed up our training.

In [82]:
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print(f'There are {torch.cuda.device_count()} GPU(s) available.')
    print('Device name:', torch.cuda.get_device_name(0))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

There are 2 GPU(s) available.
Device name: GeForce RTX 3060 Ti


We initialize and train our BERT model with one epoch. Setting the evaluation to True and train to True so the both training and evaluation will occur. It takes around 1-2 hours to train the model so you probably can go grab a cup of coffee. After training, the model will be saved in the specified filename.

In [83]:
torch.set_printoptions(profile='full')
set_seed(42)    # Set seed for reproducibility
bert_classifier, optimizer, scheduler = initialize_model(epochs=1)
train(bert_classifier, train_dataloader, val_dataloader, epochs=1, evaluation=True, train=True, save=SAVE_MODEL_NAME)

Loss 0.3270 	Macro Accuracy 0.000 	Macro Recall 0.000 	Macro Precision 0.000 	Macro F1 0.000 	Macro AUC 0.534 	Micro Accuracy 0.000 	Micro Recall 0.000 	Micro Precision nan 	Micro F1 nan 	Micro AUC 0.690 	P@5 0.282 	P@8 0.246 	P@15 0.202 	
 0.327027  |   0.89   


  return intersect_size(yhatmic, ymic, 0) / yhatmic.sum(axis=0)


Instead alot of times you would want to just train the model ones and evaluate it multiple. In this case, you would want to set the train parameter to True and it will load the model from the specified file/save.

In [80]:
#train(bert_classifier, train_dataloader, val_dataloader, epochs=4, evaluation=True, train=False)

The following predict function will be used to perform predictions.

In [81]:
def bert_predict(model, test_dataloader):
    """Perform a forward pass on the trained BERT model to predict probabilities
    on the test set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    all_logits = []

    # For each batch in our test set...
    for batch in test_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask = tuple(t.to(device) for t in batch)[:2]

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)
        all_logits.append(logits)
    
    # Concatenate logits from each batch
    all_logits = torch.cat(all_logits, dim=0)

    # Apply softmax to calculate probabilities
    #probs = F.softmax(all_logits, dim=1).cpu().numpy()
    probs = all_logits.sigmoid().cpu().numpy()
    

    return probs

#probs = all_logits.sigmoid().cpu().numpy()

With the utility function, we can obtain the predicted probability of the specified validation or test set.

In [82]:
## Compute predicted probabilities on the test set

probs = bert_predict(bert_classifier,val_dataloader)

# Evalueate the bert classifier

# evaluate_roc(probs, y_val)

Lastly, we combine the train and validation set as training data and perform predictions using the testset.

In [83]:
# Concatenate the train set and the validation set

full_train_data = torch.utils.data.ConcatDataset([train_data, val_data])
full_train_sampler = RandomSampler(full_train_data)
full_train_dataloader = DataLoader(full_train_data, sampler=full_train_sampler, batch_size=batch_size)

# Train the Bert Classifier on the entire training data
set_seed(42)
bert_classifier, optimizer, scheduler = initialize_model(epochs=10)
train(bert_classifier, full_train_dataloader, epochs=10, evaluation=True, train=False)

Start training...

 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   1    |  14254  |   0.336295   |     -      |     -     |  2341.04 
----------------------------------------------------------------------
 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   2    |  14254  |   0.329474   |     -      |     -     |  2342.74 
----------------------------------------------------------------------
 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------


KeyboardInterrupt: 

## That's all!

I have only trained the 5 BERT models once in the interest of time. Since the loss does not improve that much over one epoch, I have decided to just stick with one.