# Fine-tuning BERT for Fact Checking

# A - Introduction

In recent years the NLP community has seen many breakthoughs in Natural Language Processing, especially the shift to transfer learning. Models like ELMo, fast.ai's ULMFiT, Transformer and OpenAI's GPT have allowed researchers to achieves state-of-the-art results on multiple benchmarks and provided the community with large pre-trained models with high performance. This shift in NLP is seen as NLP's ImageNet moment, a shift in computer vision a few year ago when lower layers of deep learning networks with million of parameters trained on a specific task can be reused and fine-tuned for other tasks, rather than training new networks from scratch.

One of the most biggest milestones in the evolution of NLP recently is the release of Google's BERT, which is described as the beginning of a new era in NLP. In this notebook I'll use the HuggingFace's `transformers` library to fine-tune pretrained BERT model for a classification task. Then I will compare the BERT's performance  with a baseline model, in which I use a TF-IDF vectorizer and a Naive Bayes classifier. The `transformers` library help us quickly and efficiently fine-tune the state-of-the-art BERT model and yield an accuracy rate **10%** higher than the baseline model.

**Reference**:

To understand **Transformer** (the architecture which BERT is built on) and learn how to implement BERT, I highly recommend reading the following sources:

- [The Illustrated BERT, ELMo, and co.](http://jalammar.github.io/illustrated-bert/): A very clear and well-written guide to understand BERT.
- [The documentation of the `transformers` library](https://huggingface.co/transformers/v2.2.0/index.html)
- [BERT Fine-Tuning Tutorial with PyTorch](http://mccormickml.com/2019/07/22/BERT-fine-tuning/) by [Chris McCormick](http://mccormickml.com/): A very detailed tutorial showing how to use BERT with the HuggingFace PyTorch library.



# B - Setup

## 1. Load and download Essential Libraries

In [None]:
#=========variable to change===================

thisIsTrainRun=True #set true to train, set false to load model
enableBrutalTrain=False #set true to train several models at once. CAUTION: this will need a lot of RAM.. by a lot, I mean a LOOOOTT

#directories
modelSaveDir="/content/bert_classifier" #don't put extension
modelLoadDir="/content/bert_classifier.pt" #put extension here
testFileDir=r"/content/test.csv"
outputFileDir=r'submission.csv'

#encode parameters
add_special_tokensSwitch=True         # Add `[CLS]` and `[SEP]`
truncationSwitch=True                 # set trunct
Max_lengthInput=128                  # Pad sentence to max length
return_overflowing_tokensSwitch=True  # return overflowing token                  
pad_to_max_lengthSwitch=True          # Max length to truncate/pad
pad_to_multiple_ofSwitch=8            # nvidia stuffs
return_tensorsSwitch='pt'             # Return PyTorch tensor
return_attention_maskSwitch=True      # Return attention mask


#something in the model
batSize=32 #batch size for tuning (16 or 32)
batSizeTest=1 #batch size test
DDDI, HHH, DDDO = 768, 50, 3 #hidden size of BERT, hidden size of our classifier, and number of labels
needForSeed=42 #seed, not a racing game
learningRate=3e-5    # set learning rate (aggresive: 5e-5, less aggresive: 3e-5 or least aggresive: 2e-5. lower value to reduce the forgetting)
zaEpsilon=1e-8    # Default epsilon value
epochForInitialization=2 #number of epochs (2,3,4)
epochForTrain=2 #number of epochs.. should be same with previous but I'm not sure.. just to be safe, I'll create separate variable (2,3,4)
areYouSureAboutEvaluation=True
mFreeze=False

In [None]:
if enableBrutalTrain:
  thisIsTrainRun=True

In [None]:
import os
import re
import random
from tqdm import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch

%matplotlib inline

In [None]:
!pip install datasets
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m14.5 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB

## 2. Dataset

### 2.1. Download Dataset

In [None]:
from datasets import load_dataset
fever_train = load_dataset("copenlu/fever_gold_evidence", split='train')
fever_valid = load_dataset("copenlu/fever_gold_evidence", split='validation')

Downloading readme:   0%|          | 0.00/5.19k [00:00<?, ?B/s]

Downloading and preparing dataset json/copenlu--fever_gold_evidence to /root/.cache/huggingface/datasets/copenlu___json/copenlu--fever_gold_evidence-fea74f116be5cee3/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/96.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.63M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/6.47M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/copenlu___json/copenlu--fever_gold_evidence-fea74f116be5cee3/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.




In [None]:
# data format, we only need claim, label, evidence
# https://huggingface.co/datasets/copenlu/fever_gold_evidence
print(fever_train[0])

{'claim': 'The number of new cases of shingles per year extends from 1.2–3.4 per 1,000 among healthy individuals.', 'label': 'SUPPORTS', 'evidence': [['Shingles', '31', 'The number of new cases per year ranges from 1.2 -- 3.4 per 1,000 among healthy individuals to 3.9 -- 11.8 per 1,000 among those older than 65 years of age .']], 'id': '98faa551d5973b62591e0835bf898d84', 'verifiable': 'VERIFIABLE', 'original_id': 123397}


In [None]:

unique_tags = set(data['label'] for data in fever_valid)
tag2id = {tag: id for id, tag in enumerate(unique_tags)}
id2tag = {id: tag for tag, id in tag2id.items()}
print(tag2id, id2tag)

{'NOT ENOUGH INFO': 0, 'SUPPORTS': 1, 'REFUTES': 2} {0: 'NOT ENOUGH INFO', 1: 'SUPPORTS', 2: 'REFUTES'}


In [None]:
""" TODO: need to read and APPEND evidences HERE """
def read_claim_label(dataset):
  labels, claims, evidences = [], [], []
  trueEvidences=[]
  for data in dataset:
    labels.append(tag2id[data['label']])
    claims.append(data['claim'])
    evidences.append(data["evidence"])
  return labels, claims, evidences

if thisIsTrainRun:
  train_labels, train_claims, train_evidences = read_claim_label(fever_train)
  valid_labels, valid_claims, valid_evidences = read_claim_label(fever_valid)

  trueTrain_evidences=[]
  trueValid_evidences=[]

  for i in train_evidences:
    trueTrain_evidences.append(i[0])

  for i in valid_evidences:
    trueValid_evidences.append(i[0])

In [None]:
if thisIsTrainRun:
  print("claims: ",train_claims[0],"\nevidences:",trueTrain_evidences[0])

claims:  The number of new cases of shingles per year extends from 1.2–3.4 per 1,000 among healthy individuals. 
evidences: ['Shingles', '31', 'The number of new cases per year ranges from 1.2 -- 3.4 per 1,000 among healthy individuals to 3.9 -- 11.8 per 1,000 among those older than 65 years of age .']


### 2.2. Load Train Data


### 2.3. Load Test Data


## 3. Set up GPU for training

Google Colab offers free GPUs and TPUs. Since we'll be training a large neural network it's best to utilize these features.

A GPU can be added by going to the menu and selecting:

`Runtime -> Change runtime type -> Hardware accelerator: GPU`

Then we need to run the following cell to specify the GPU as the device.

In [None]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


# D - Fine-tuning BERT

## 1. Install the Hugging Face Library

The transformer library of Hugging Face contains PyTorch implementation of state-of-the-art NLP models including BERT (from Google), GPT (from OpenAI) ... and pre-trained model weights.

## 2. Tokenization and Input Formatting

Before tokenizing our text, we will perform some slight processing on our text including removing entity mentions (eg. @united) and some special character. The level of processing here is much less than in previous approachs because BERT was trained with the entire sentences.

In [None]:
def text_preprocessing(text):
    """
    - Remove entity mentions (eg. '@united')
    - Correct errors (eg. '&amp;' to '&')
    @param    text (str): a string to be processed.
    @return   text (Str): the processed string.
    """
    # Remove '@name'
    text = re.sub(r'(@.*?)[\s]', ' ', text)

    # Replace '&amp;' with '&'
    text = re.sub(r'&amp;', '&', text)

    # Remove trailing whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

### 2.1. BERT Tokenizer

In order to apply the pre-trained BERT, we must use the tokenizer provided by the library. This is because (1) the model has a specific, fixed vocabulary and (2) the BERT tokenizer has a particular way of handling out-of-vocabulary words.

In addition, we are required to add special tokens to the start and end of each sentence, pad & truncate all sentences to a single constant length, and explicitly specify what are padding tokens with the "attention mask".

The `encode_plus` method of BERT tokenizer will:

(1) split our text into tokens,

(2) add the special `[CLS]` and `[SEP]` tokens, and

(3) convert these tokens into indexes of the tokenizer vocabulary,

(4) pad or truncate sentences to max length, and

(5) create attention mask.






In [None]:
from transformers import BertTokenizer

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

# Create a function to tokenize a set of texts
# TODO: Modify to load claim and evidences
def preprocessing_for_bert(data, forgiveMySinFather): #preprocessing_for_bert(claim, evidences):
    """Perform required preprocessing steps for pretrained BERT.
    @param    data (np.array): Array of texts to be processed.
    @return   input_ids (torch.Tensor): Tensor of token ids to be fed to a model.
    @return   attention_masks (torch.Tensor): Tensor of indices specifying which
                  tokens should be attended to by the model.
    """
    # Create empty lists to store outputs
    input_ids = []
    attention_masks = []
    container=[]

    #preprocessing datasets
    for sent in data:
      container.append(text_preprocessing(sent))

    data=container
    container=[]

    for sentList in forgiveMySinFather:
      container2=[]
      for sent in sentList:
        container2.append(text_preprocessing(sent))
      container.append(container2)

    forgiveMySinFather=container

    # For every sentence...
    for zawarudo, starPlatinum in zip(data, forgiveMySinFather):
        # `encode_plus` will:
        #    (1) Tokenize the sentence
        #    (2) Add the `[CLS]` and `[SEP]` token to the start and end
        #    (3) Truncate/Pad sentence to max length
        #    (4) Map tokens to their IDs
        #    (5) Create attention mask
        #    (6) Return a dictionary of outputs
        '''
        TODO:
        Modify to add claim and evidences
        see https://huggingface.co/docs/transformers/v4.27.2/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.encode_plus for details
        '''
        encoded_sent = tokenizer.encode_plus(
            text=zawarudo,  # Preproc claim
            text_pair=starPlatinum,          #preproc evid
            add_special_tokens=add_special_tokensSwitch,   # Add `[CLS]` and `[SEP]`
            #truncation=truncationSwitch,                #set trunct
            max_length=Max_lengthInput, # Pad sentence to max length
            #stride=MAX_LEN,                
            #return_overflowing_tokens=return_overflowing_tokensSwitch,  #overflow con               
            pad_to_max_length=pad_to_max_lengthSwitch,             # Max length to truncate/pad
            #pad_to_multiple_of=pad_to_multiple_ofSwitch,         #nvidia opt
            #return_tensors=return_tensorsSwitch,           # Return PyTorch tensor
            return_attention_mask=return_attention_maskSwitch      # Return attention mask
            )

        # Add the outputs to the lists
        input_ids.append(encoded_sent.get('input_ids'))
        attention_masks.append(encoded_sent.get('attention_mask'))

    # Convert lists to tensors
    input_ids = torch.tensor(input_ids)
    attention_masks = torch.tensor(attention_masks)

    return input_ids, attention_masks

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Now let's tokenize our data.

In [None]:
if thisIsTrainRun:
  from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
  train_inputs, train_masks = preprocessing_for_bert(train_claims, trueTrain_evidences) #preprocessing_for_bert(train_claims, train_evidences)
  val_inputs, val_masks = preprocessing_for_bert(valid_claims, trueValid_evidences) #preprocessing_for_bert(valid_claims, valid_evidences)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Before tokenizing, we need to specify the maximum length of our sentences.

In [None]:
if thisIsTrainRun:
  print(train_inputs[0])

tensor([  101,  1996,  2193,  1997,  2047,  3572,  1997, 12277, 17125,  2566,
         2095,  8908,  2013,  1015,  1012,  1016,  1516,  1017,  1012,  1018,
         2566,  1015,  1010,  2199,  2426,  7965,  3633,  1012,   102,   100,
         2861,   100,   102,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0])

### 2.2. Create PyTorch DataLoader

We will create an iterator for our dataset using the torch DataLoader class. This will help save on memory during training and boost the training speed.

In [None]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

if thisIsTrainRun:
  # Convert other data types to torch.Tensor
  train_labels = torch.tensor(train_labels)
  val_labels = torch.tensor(valid_labels)

  # Create the DataLoader for our training set
  train_data = TensorDataset(train_inputs, train_masks, train_labels)
  train_sampler = RandomSampler(train_data)
  train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batSize)

  # Create the DataLoader for our validation set
  val_data = TensorDataset(val_inputs, val_masks, val_labels)
  val_sampler = SequentialSampler(val_data)
  val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batSize)

## 3. Train Our Model

### 3.1. Create BertClassifier

BERT-base consists of 12 transformer layers, each transformer layer takes in a list of token embeddings, and produces the same number of embeddings with the same hidden size (or dimensions) on the output. The output of the final transformer layer of the `[CLS]` token is used as the features of the sequence to feed a classifier.

The `transformers` library has the [`BertForSequenceClassification`](https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#bertforsequenceclassification) class which is designed for classification tasks. However, we will create a new class so we can specify our own choice of classifiers.

Below we will create a BertClassifier class with a BERT model to extract the last hidden layer of the `[CLS]` token and a single-hidden-layer feed-forward neural network as our classifier.

In [None]:
%%time
import torch
import torch.nn as nn
from transformers import BertModel

# Create the BertClassfier class
class BertClassifier(nn.Module):
    """Bert Model for Classification Tasks.
    """
    def __init__(self, freeze_bert=mFreeze):
        """
        @param    bert: a BertModel object
        @param    classifier: a torch.nn.Module classifier
        @param    freeze_bert (bool): Set `False` to fine-tune the BERT model
        """
        super(BertClassifier, self).__init__()
        # Specify hidden size of BERT, hidden size of our classifier, and number of labels
        D_in, H, D_out = DDDI, HHH, DDDO

        # Instantiate BERT model
        self.bert = BertModel.from_pretrained('bert-base-uncased')

        # Instantiate an one-layer feed-forward classifier
        self.classifier = nn.Sequential(
            nn.Linear(D_in, H),
            nn.ReLU(),
            #nn.Dropout(0.5),
            nn.Linear(H, D_out)
        )

        # Freeze the BERT model
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False
        
    def forward(self, input_ids, attention_mask):
        """
        Feed input to BERT and the classifier to compute logits.
        @param    input_ids (torch.Tensor): an input tensor with shape (batch_size,
                      max_length)
        @param    attention_mask (torch.Tensor): a tensor that hold attention mask
                      information with shape (batch_size, max_length)
        @return   logits (torch.Tensor): an output tensor with shape (batch_size,
                      num_labels)
        """
        # Feed input to BERT
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)
        
        # Extract the last hidden state of the token `[CLS]` for classification task
        last_hidden_state_cls = outputs[0][:, 0, :]

        # Feed input to classifier to compute logits
        logits = self.classifier(last_hidden_state_cls)

        return logits

CPU times: user 44 ms, sys: 2.03 ms, total: 46 ms
Wall time: 46.7 ms


### 3.2. Optimizer & Learning Rate Scheduler

To fine-tune our Bert Classifier, we need to create an optimizer. The authors recommend following hyper-parameters:

- Batch size: 16 or 32
- Learning rate (Adam): 5e-5, 3e-5 or 2e-5
- Number of epochs: 2, 3, 4

Huggingface provided the [run_glue.py](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L109) script, an examples of implementing the `transformers` library. In the script, the AdamW optimizer is used.

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

def initialize_model(epochs=4):
    """Initialize the Bert Classifier, the optimizer and the learning rate scheduler.
    """
    # Instantiate Bert Classifier
    bert_classifier = BertClassifier(freeze_bert=mFreeze)

    # Tell PyTorch to run the model on GPU
    bert_classifier.to(device)

    # Create the optimizer
    optimizer = AdamW(bert_classifier.parameters(),
                      lr=learningRate,    # Default learning rate
                      eps=zaEpsilon    # Default epsilon value
                      )

    # Total number of training steps
    total_steps = len(train_dataloader) * epochs

    # Set up the learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(optimizer,
                                                num_warmup_steps=0, # Default value
                                                num_training_steps=total_steps)
    return bert_classifier, optimizer, scheduler

### 3.3. Training Loop

We will train our Bert Classifier for 1 epochs. In each epoch, we will train our model and evaluate its performance on the validation set. In more details, we will:

Training:
- Unpack our data from the dataloader and load the data onto the GPU
- Zero out gradients calculated in the previous pass
- Perform a forward pass to compute logits and loss
- Perform a backward pass to compute gradients (`loss.backward()`)
- Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
- Update the model's parameters (`optimizer.step()`)
- Update the learning rate (`scheduler.step()`)

Evaluation:
- Unpack our data and load onto the GPU
- Forward pass
- Compute loss and accuracy rate over the validation set

The script below is commented with the details of our training and evaluation loop. 

In [None]:
import random
import time

# Specify loss function
loss_fn = nn.CrossEntropyLoss()

def set_seed(seed_value=42):
    """Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)

def train(model, train_dataloader, val_dataloader=None, epochs=4, evaluation=True):
    """Train the BertClassifier model.
    """
    # Start training loop
    print("Start training...\n")
    for epoch_i in range(epochs):
        # ==============================================================================
        #               Training
        # ==============================================================================
        # Print the header of the result table
        print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Elapsed':^9}")
        print("-"*70)

        # Measure the elapsed time of each epoch
        t0_epoch, t0_batch = time.time(), time.time()

        # Reset tracking variables at the beginning of each epoch
        total_loss, batch_loss, batch_counts = 0, 0, 0

        # Put the model into the training mode
        model.train()

        # For each batch of training data...
        for step, batch in enumerate(train_dataloader):
            batch_counts +=1
            # Load batch to GPU
            b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

            # Zero out any previously calculated gradients
            model.zero_grad()

            # Perform a forward pass. This will return logits.
            logits = model(b_input_ids, b_attn_mask)

            # Compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            batch_loss += loss.item()
            total_loss += loss.item()

            # Perform a backward pass to calculate gradients
            loss.backward()

            # Clip the norm of the gradients to 1.0 to prevent "exploding gradients"
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

            # Update parameters and the learning rate
            optimizer.step()
            scheduler.step()

            # Print the loss values and time elapsed for every 20 batches
            if (step % 300 == 0 and step != 0) or (step == len(train_dataloader) - 1):
                # Calculate time elapsed for 20 batches
                time_elapsed = time.time() - t0_batch

                # Print training results
                print(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {time_elapsed:^9.2f}")

                # Reset batch tracking variables
                batch_loss, batch_counts = 0, 0
                t0_batch = time.time()

        # Calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)

        print("-"*70)
        # ==============================================================================
        #               Evaluation
        # ==============================================================================
        if evaluation == True:
            # After the completion of each training epoch, measure the model's performance
            # on our validation set.
            val_loss, val_accuracy = evaluate(model, val_dataloader)

            # Print performance over the entire training data
            time_elapsed = time.time() - t0_epoch
            
            print(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {time_elapsed:^9.2f}")
            print("-"*70)
        print("\n")
    
    print("Training complete!")


def evaluate(model, val_dataloader):
    """After the completion of each training epoch, measure the model's performance
    on our validation set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    # Tracking variables
    val_accuracy = []
    val_loss = []

    # For each batch in our validation set...
    for batch in val_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)

        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)

        # Compute loss
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())

        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()

        # Calculate the accuracy rate
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_accuracy.append(accuracy)

    # Compute the average accuracy and loss over the validation set.
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)

    return val_loss, val_accuracy

Now, let's start training our BertClassifier!
<br><br>*these parts were modified by: ArielP<111524603> to enable saving, brutal training, or loading*

In [None]:
if thisIsTrainRun and enableBrutalTrain==False:
  set_seed(needForSeed)    # Set seed for reproducibility
  bert_classifier, optimizer, scheduler = initialize_model(epochs=epochForInitialization)
  train(bert_classifier, train_dataloader, val_dataloader, epochs=epochForTrain, evaluation=areYouSureAboutEvaluation)
  #save the model bro! I don't want to train it again *insert crying emoji here
  from datetime import datetime
  uidentif=datetime.now()
  uidentif="[E"+str(epochForTrain)+"B"+str(batSize)+"LR"+str(learningRate)+"]"+"["+str(uidentif.year)+"-"+str('{:02d}'.format(uidentif.month))+"-"+str('{:02d}'.format(uidentif.day))+"]["+str(uidentif.hour)+str(uidentif.minute)+str(uidentif.second)+"]"
  modelSaveDir=modelSaveDir+uidentif
  torch.save(bert_classifier,modelSaveDir+"[model].pt")
  torch.save(bert_classifier.state_dict(),modelSaveDir+"[state].pt")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Start training...

 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Elapsed 
----------------------------------------------------------------------
   1    |   300   |   0.804495   |     -      |     -     |  184.23  
   1    |   600   |   0.705062   |     -      |     -     |  184.14  
   1    |   900   |   0.673753   |     -      |     -     |  184.34  
   1    |  1200   |   0.653121   |     -      |     -     |  184.19  
   1    |  1500   |   0.633642   |     -      |     -     |  184.33  
   1    |  1800   |   0.621140   |     -      |     -     |  184.25  
   1    |  2100   |   0.619969   |     -      |     -     |  184.07  


In [None]:
def brutallyInitialize_model(epochs,rateOfLearning):
    """Initialize the Bert Classifier, the optimizer and the learning rate scheduler.
    """
    # Instantiate Bert Classifier
    bert_classifier = BertClassifier(freeze_bert=mFreeze)

    # Tell PyTorch to run the model on GPU
    bert_classifier.to(device)

    # Create the optimizer
    optimizer = AdamW(bert_classifier.parameters(),
                      lr=rateOfLearning,    # Default learning rate
                      eps=zaEpsilon    # Default epsilon value
                      )

    # Total number of training steps
    total_steps = len(train_dataloader) * epochs

    # Set up the learning rate scheduler
    scheduler = get_linear_schedule_with_warmup(optimizer,
                                                num_warmup_steps=0, # Default value
                                                num_training_steps=total_steps)
    return bert_classifier, optimizer, scheduler

if enableBrutalTrain:
  brutalBatSize=[16,32]
  brutalLearningRate=[5e-5,3e-5,2e-5]
  brutalEpoch=[2,3,4]

  for epp in brutalEpoch:
    for batt in brutalBatSize:
      for rate in brutalLearningRate:
        # Convert other data types to torch.Tensor
        train_labels = torch.tensor(train_labels)
        val_labels = torch.tensor(valid_labels)

        # Create the DataLoader for our training set
        train_data = TensorDataset(train_inputs, train_masks, train_labels)
        train_sampler = RandomSampler(train_data)
        train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batt)

        # Create the DataLoader for our validation set
        val_data = TensorDataset(val_inputs, val_masks, val_labels)
        val_sampler = SequentialSampler(val_data)
        val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batt)

        set_seed(needForSeed)    # Set seed for reproducibility
        bert_classifier, optimizer, scheduler = initialize_model(epp,rate)
        train(bert_classifier, train_dataloader, val_dataloader, epochs=epp, evaluation=areYouSureAboutEvaluation)
        #save the model bro! I don't want to train it again *insert crying emoji here
        from datetime import datetime
        uidentif=datetime.now()
        uidentif="[E"+epp+"B"+batt+"LR"+rate+"]"+"["+str(uidentif.year)+"-"+str('{:02d}'.format(uidentif.month))+"-"+str('{:02d}'.format(uidentif.day))+"]["+str(uidentif.hour)+str(uidentif.minute)+str(uidentif.second)+"]"
        modelSaveDir=modelSaveDir+uidentif
        torch.save(bert_classifier,modelSaveDir+"[model].pt")
        torch.save(bert_classifier.state_dict(),modelSaveDir+"[state].pt")

In [None]:
if thisIsTrainRun==False and enableBrutalTrain==False:
  bert_classifier=BertClassifier()
  bert_classifier.load_state_dict(torch.load(modelLoadDir))
  bert_classifier.eval()

## 4. Predictions on Test Set

### 4.1. Data Preparation

Before making predictions on the test set, we need to redo processing and encoding steps done on the training data. Fortunately, we have written the `preprocessing_for_bert` function to do that for us.

In [None]:
df=pd.read_csv(testFileDir)

#checking null data
nanCl=df['claim'].isnull().sum()
nanEv=df['evidences'].isnull().sum()
print("blank cells for claim: ",nanCl)
print("blank cells for evidences: ",nanEv)
 
#clean null data
if nanCl>0 or nanEv>0:
  print("performing necessary cleaning")
  df["claim"]=df["claim"].fillna("encoding error")
  df["evidences"]=df["evidences"].fillna("encoding error")
  df2=df
else:
  print("no cleaning needed")
  df2=df

display(df2)

blank cells for claim:  0
blank cells for evidences:  2
performing necessary cleaning


Unnamed: 0,Id,claim,evidences
0,23567,Kareena Kapoor was a commercial failure.,This initial success was followed by a series ...
1,10499,Oscar Isaac acted in Twilight.,"Oscar Isaac -LRB- born 1969 -RRB- , actor and ..."
2,72143,Designated Survivor (TV series) is a televisio...,Designated Survivor is an American political d...
3,192200,"Sacre-Coeur, Paris is an embodiment of cultism.",`` Atop The Sacre-Coeur '' by Franck Pourcel
4,25760,"Pierce County, Washington is the home of a foo...",Pierce County Courthouse -LRB- Washington -RRB...
...,...,...,...
995,10150,The IPhone 4 has an advanced mobile operating ...,"It debuted with iOS 5 , the fifth major versio..."
996,116026,Kareem Abdul-Jabbar is ranked in rebounds.,"At the time of his retirement in 1989 , Abdul-..."
997,108163,SpongeBob SquarePants is a long running televi...,He began developing SpongeBob SquarePants into...
998,67747,Jed Whedon was born in a hospital.,"Jed Tucker Whedon -LRB- born July 18 , 1974 -R..."


In [None]:
""" TODO: need to read and APPEND evidences HERE """
def read_test_data(test_file):
  df=test_file
  ids, claims, evidences = [], [], [] 
  for index, row in df.iterrows():
    ids.append(row['Id'])
    claims.append(row['claim'])
    # the evidences separated by '\n' 
    evidences.append(row['evidences']) 
  return ids, claims, evidences

In [None]:
test_ids, test_claims, test_evidences = read_test_data(df2)

In [None]:
trueTest_evidences=[]
for evid in test_evidences:
  trueTest_evidences.append(evid.split(' , '))

In [None]:
#cheki cheki UwU
print(test_ids[0], test_claims[0],trueTest_evidences[0])

23567 Kareena Kapoor was a commercial failure. ['This initial success was followed by a series of commercial failures and repetitive roles', 'which garnered her negative reviews .']


In [None]:
# Run `preprocessing_for_bert` on the test set
print('Tokenizing data...')
test_inputs, test_masks = preprocessing_for_bert(test_claims, test_evidences) 
test_ids = torch.tensor(test_ids)

# Create the DataLoader for our test set
test_dataset = TensorDataset(test_inputs, test_masks, test_ids)
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset, sampler=test_sampler, batch_size=batSizeTest)

Tokenizing data...


Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.
Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pai

### 4.2. Predictions

In [None]:
import torch.nn.functional as F

def bert_predict(model, test_dataloader):
    """Perform a forward pass on the trained BERT model to predict probabilities
    on the test set.
    """
    # Put the model into the evaluation mode. The dropout layers are disabled during
    # the test time.
    model.eval()

    all_logits, all_preds = [], []

    # For each batch in our test set...
    for batch in test_dataloader:
        # Load batch to GPU
        b_input_ids, b_attn_mask, b_ids = tuple(t.to(device) for t in batch)[:]
        
        # Compute logits
        with torch.no_grad():
            logits = model(b_input_ids, b_attn_mask)
        # Get the predictions
        preds = torch.argmax(logits, dim=1).flatten()
        in_id, preds = b_ids.cpu().numpy()[0], preds.cpu().numpy()[0]
        all_preds.append(dict(Id=in_id, Category=id2tag[preds]))

    return all_preds

In [None]:
all_preds = bert_predict(bert_classifier, test_dataloader)
df = pd.DataFrame.from_dict(all_preds) 
df.to_csv (outputFileDir, index=False, header=True)
df

Unnamed: 0,Id,Category
0,23567,SUPPORTS
1,10499,SUPPORTS
2,72143,SUPPORTS
3,192200,SUPPORTS
4,25760,NOT ENOUGH INFO
...,...,...
995,10150,SUPPORTS
996,116026,SUPPORTS
997,108163,SUPPORTS
998,67747,NOT ENOUGH INFO
