# **Open Domain QA with BERT**

# Introduction

The following is an example from the first input of the SQuAD `train` dataset:

**question:** what percentage of imperial 's staff was classified as world leading in 2008 ?

**context:** the 2008 research assessment exercise returned 26 % of the 1225 staff submitted as being world-leading ( 4* ) and a further 47 % as being internationally excellent ( 3* ) . the 2008 research assessment exercise also showed five subjects – pure mathematics , epidemiology and public health , chemical engineering , civil engineering , and mechanical , aeronautical and manufacturing engineering – were assessed to be the best [ clarification needed ] in terms of the proportion of internationally recognised research quality .

**answer:** 26 %

**span:** 6 7

# Initial Set-up

In [None]:
# Initial set up
!pip install transformers
from google.colab import drive
drive.mount('/content/gdrive')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 5.0 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 59.2 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 51.4 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 2.6 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstallin

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import transformers
from transformers import DistilBertTokenizer, DistilBertForQuestionAnswering
from transformers import get_scheduler
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import accuracy_score, f1_score
from tqdm.auto import tqdm
import collections
import os
import random
import time
from datetime import timedelta
import string, re

transformers.logging.set_verbosity_error()  # output only ERROR level logs

In [None]:
# Set seed and working device
def set_seed(seed_value=40):
    """
    Set seed for reproducibility.
    """
    random.seed(seed_value)
    np.random.seed(seed_value)
    torch.manual_seed(seed_value)
    torch.cuda.manual_seed_all(seed_value)


set_seed()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
n_gpu = torch.cuda.device_count()
print(torch.cuda.get_device_name())

cuda
Tesla P100-PCIE-16GB


# Data processing

Here we are leveraging BERT's built-in [Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer#transformers.PreTrainedTokenizer.__call__) to take our text inputs from the SQuAD dataset (i.e., a list of questions and a list of corresponding contexts) and turn them into tensors. The function takes the following arguments:
- `padding`: Adds padding when the question and/or context is too short.
- `truncation`: Limits the question and/or context to 512 characters (required by BERT). The truncation strategy used is `longest_first` by default.
- `max_length`: Controls the maximum length used by `truncation` and `padding` parameters.
- `return_tensors`: Returns PyTorch tensors objects, from which `input_ids` and `attention_mask` can be derived.

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

def convert_to_BERT_tensors(questions, contexts):
    """
    Takes a parallel list of question strings and context strings and converts them into BERT tensors.
    """
    tokenized = tokenizer(questions, contexts, padding=True, truncation=True, max_length=512, return_tensors="pt")
    return tokenized["input_ids"], tokenized["attention_mask"]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [None]:
test_questions = ["Why?", "How?", "When?"]
test_contexts = ["Because we can", 
                 "Working hard while having fun!", 
                 "Starting today!" + "".join(["!"] * 512) + "Let's go!"]

ids, mask = convert_to_BERT_tensors(test_questions,test_contexts)
print(f"ids shape: {ids.shape}")
print(f"mask shape: {mask.shape}")

ids shape: torch.Size([3, 512])
mask shape: torch.Size([3, 512])


In [None]:
tokenizer.tokenize("[CLS]" + test_questions[0] + "[SEP]" + test_contexts[0])

['[CLS]', 'why', '?', '[SEP]', 'because', 'we', 'can']

In [None]:
# First row: '[CLS]', 'why', '?', '[SEP]', 'because', ´we´, 'can' => (With padding)
ids[0][0:100]

tensor([ 101, 2339, 1029,  102, 2138, 2057, 2064,  102,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0])

In [None]:
tokenizer.tokenize("[CLS]" + test_questions[2] + "[SEP]" + test_contexts[2])[:10]

['[CLS]', 'when', '?', '[SEP]', 'starting', 'today', '!', '!', '!', '!']

In [None]:
# Second row: '[CLS]', 'when', '?', '[SEP]', 'starting', 'today' => (Without padding)
ids[2][0:100]

tensor([ 101, 2043, 1029,  102, 3225, 2651,  999,  999,  999,  999,  999,  999,
         999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,
         999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,
         999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,
         999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,
         999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,
         999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,
         999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,  999,
         999,  999,  999,  999])

With an approach in place to convert questions and contexts into tensors, we need to turn our attention now to the answers provided in the SQuAD dataset, which come in the form of a string (e.g., "26 %") and a span (e.g., [6, 7]) representing token indices in the context tensor.

Since BERT Tokenizer object merges each question and context into a single input, the token indices provided in the span need to be recalculated. This work will be conducted by the following function.

In [None]:
def get_answer_span_tensor(question, context, answer):
    """
    Recomputes the answer span by combining the question and context into an input and
    identifying the correct answer span inside of it.
    If the answer doesn't appear in the input, it will return [0, 0].
    """
    input_str = "[CLS]" + question + "[SEP]" + context
    input_tokens = tokenizer.tokenize(input_str)
    answer_tokens = tokenizer.tokenize(answer)
    span_len = len(answer_tokens)
    for i in range(min(len(input_tokens) - span_len + 1, 512 - span_len - 1)):
        if input_tokens[i:i + span_len] == answer_tokens:
            answer_span = torch.tensor([i,i+span_len - 1])
            break
    else:
        answer_span = torch.tensor([0,0])
        
    return answer_span

In [None]:
test_answer = "Having fun!"
test_answer_span = get_answer_span_tensor(test_questions[1], test_contexts[1], test_answer)
test_answer_span

tensor([7, 9])

In [None]:
input_str = "[CLS]" + test_questions[1] + "[SEP]" + test_contexts[1]
[(i, t) for i, t in enumerate(tokenizer.tokenize(input_str))]

[(0, '[CLS]'),
 (1, 'how'),
 (2, '?'),
 (3, '[SEP]'),
 (4, 'working'),
 (5, 'hard'),
 (6, 'while'),
 (7, 'having'),
 (8, 'fun'),
 (9, '!')]

Lastly, we will proceed to create the data structure and dataloaders to prepare our training, dev, and test datasets.

In [None]:
BATCH_SIZE=16

class QAdataset(Dataset):
    """
    A custom dataset for housing QA data, including input_data, output_data, and padding mask.
    """
    def __init__(self, input_data, output_data, mask):
        self.input_data = input_data
        self.output_data = output_data
        self.mask = mask

    def __len__(self):
        return len(self.input_data)

    def __getitem__(self, index):
        data_val = self.input_data[index]
        target = self.output_data[index]
        mask = self.mask[index]
        return data_val, target, mask


def read_files(path, split):
    """
    Reads the SQuAD files and converts them into tensors.
    """
    for entry in os.listdir(path):
      if split in entry:
          with open(path + entry, "r", encoding="utf-8") as f:
            if "question" in entry:
              questions = f.readlines()
            elif "context" in entry:
              contexts = f.readlines()
            elif "answer" in entry:
              answers = f.readlines()
            else:
              continue
    return questions, contexts, answers


def prepare_QA_dataset(split, path):
    """
    Prepares the PyTorch dataset for the train, dev, and test splits.
    """
    questions, contexts, answers = read_files(path, split)
    ids, mask = convert_to_BERT_tensors(questions, contexts)
    spans = []
    for question, context, answer in zip(questions, contexts, answers):
        spans.append(get_answer_span_tensor(question, context, answer))
    return QAdataset(ids, spans, mask)


def prepare_dataloaders(split, squad_path, output_path, dtl_fn, 
                        batch_size=BATCH_SIZE, shuffle=False):
    """
    Helper function to generate and persist Dataloaders for each split.
    """
    data = prepare_QA_dataset(split, squad_path)
    dataloader = DataLoader(data, 
                            batch_size=batch_size, 
                            shuffle=shuffle)
    torch.save(dataloader, output_path + f"/{dtl_fn}")
    print(f"Generated and saved {dtl_fn}")

In [None]:
# define relevant dir paths
squad_path = '/content/gdrive/MyDrive/Colab Notebooks/open_domain_QA_BERT/data/'
small_squad_path = "/content/gdrive/MyDrive/Colab Notebooks/data/small/"
artifacts_dirpath = "/content/gdrive/MyDrive/Colab Notebooks/open_domain_QA_BERT/artifacts/"

# prepare Dataloaders
prepare_dataloaders("train", squad_path, artifacts_dirpath, "train.dtl")
prepare_dataloaders("dev", squad_path, artifacts_dirpath, "dev.dtl")
prepare_dataloaders("test", squad_path, artifacts_dirpath, "test.dtl")

---
# Model training

In order to train our BERT model to be suitable for the QA task at hand, we will load the `DistilBertForQuestionAnswering` module, which comes with a regular DistilBert pre-trained language model `distilbert-base-uncased`.

DistilBert is a "lightweight" version of BERT thanks to the [knowledge distillation](https://medium.com/huggingface/distilbert-8cf3380435b5) technique, which makes it much faster to train and suits our demonstration purposes nicely.

In the interest of time, we will only set up one loss function, optimizer, and learning rate value (i.e., no hyperparameter optimization will be conducted), and run the model training for just one epoch.

In [None]:
# load Dataloaders
artifacts_dirpath = "/content/gdrive/MyDrive/Colab Notebooks/open_domain_QA_BERT/artifacts/"

train_iter = torch.load(artifacts_dirpath + "/train.dtl")
dev_iter = torch.load(artifacts_dirpath + "/dev.dtl")
test_iter = torch.load(artifacts_dirpath + "/test.dtl")

In [None]:
# load dev and test questions, contexts, and answers
squad_path = '/content/gdrive/MyDrive/Colab Notebooks/open_domain_QA_BERT/data/'
dev_questions, dev_contexts, dev_gold_answers = read_files(squad_path, "dev")
test_questions, test_contexts, test_gold_answers = read_files(squad_path, "test")

In [None]:
# Parameters
LR = 3e-5
MAX_GRAD_NORM = 1.0
EPOCHS = 3
WARMUP_PROPORTION = 0.1
NUM_TRAINING_STEPS = len(train_iter) * EPOCHS
NUM_WARMUP_STEPS = NUM_TRAINING_STEPS * WARMUP_PROPORTION

In [None]:
ckpt_path = "/content/gdrive/MyDrive/Colab Notebooks/open_domain_QA_BERT/ckpt/"
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased").to(device)
optimizer = optim.AdamW(model.parameters(), lr=LR)
lr_scheduler = get_scheduler(name="linear", 
                             optimizer=optimizer, 
                             num_warmup_steps=NUM_WARMUP_STEPS,
                             num_training_steps=NUM_TRAINING_STEPS)
criterion = nn.CrossEntropyLoss()

In [None]:
# model parameter count
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"The model has {count_parameters(model):,} trainable parameters")

The model has 66,364,418 trainable parameters


In [None]:
def train(model, optimizer, scheduler, criterion, iterator, t_batch):
    epoch_loss = 0
    model.train()
    for i, (ids, spans, mask) in enumerate(iterator):
        # initialize gradients
        model.zero_grad()
        # loads inputs into GPU
        ids = ids.to(device)
        spans = spans.to(device)
        mask = mask.to(device)
        # calls the model with inputs
        outputs = model(ids, mask)
        start_loss = criterion(outputs.start_logits, spans[:, 0])
        end_loss = criterion(outputs.end_logits, spans[:, 1])
        total_loss = start_loss + end_loss
        # performs backward pass and updates optimizer
        total_loss.backward()
        optimizer.step()
        lr_scheduler.step()
        epoch_loss += total_loss
        # delete used variables to free GPU memory
        del ids, spans, mask
        # clears cache
        torch.cuda.empty_cache()
        # print status after 20 steps
        if i % 20 == 0:
          time_elapsed = time.time() - t_batch
          print(f"Processed {i * BATCH_SIZE} QA pairs of {len(train_iter.dataset)}")
          print(f"Last loss: {total_loss.item()}")
          print(f"Time_elapsed: {str(timedelta(seconds=time_elapsed))}")
        epoch_loss += total_loss.item()
        progress_bar.update(1)
    return epoch_loss

In [None]:
def evaluate(model, tokenizer, iterator, gold_answers):
    pred_starts, pred_ends = [], []
    best_starts, best_ends = [], []
    gold_starts, gold_ends = [], []
    pred_answers, best_answers = [], []
    pred_f1_score = 0
    best_f1_score = 0
    pred_em_score = 0
    best_em_score = 0
    progress_bar = tqdm(range(len(iterator)))
    model.eval()

    with torch.no_grad():
        for ids, spans, mask in iterator:
            # loads inputs into GPU
            ids, spans, mask = ids.to(device), spans.to(device), mask.to(device)
            output = model(ids, mask)
            # compute start and end span probabilities
            start_probs = output.start_logits.to("cpu").detach()
            pred_starts.extend(list(np.argmax(start_probs.numpy(), axis=1)))
            end_probs = output.end_logits.to("cpu").detach()
            pred_ends.extend(list(np.argmax(end_probs.numpy(), axis=1)))
            # prepare ground truth data
            gold_targets = spans.to("cpu").detach().numpy()
            gold_starts.extend(list(gold_targets[:, 0]))
            gold_ends.extend(list(gold_targets[:, 1]))
            # compute best spans
            log_start_probs = F.log_softmax(start_probs, dim=1).numpy()
            log_end_probs = F.log_softmax(end_probs, dim=1).numpy()
            best_spans = select_best_answer_span(log_start_probs, log_end_probs, distance=10)
            best_starts.extend([best_span[0] for best_span in best_spans])
            best_ends.extend([best_span[1] for best_span in best_spans])
            # gathers pred and gold text answers for F1 score
            pred_answers.extend(get_pred_answers(ids, spans, tokenizer))
            best_answers.extend(get_pred_answers(ids, best_spans, tokenizer))
            progress_bar.update(1)

    # compute evaluation metrics
    pred_acc = accuracy_score(gold_starts, pred_starts)
    best_acc = accuracy_score(gold_starts, best_starts)

    assert len(pred_answers) == len(best_answers) == len(gold_answers)
    for pred_answer, best_answer, gold_answer in zip(pred_answers, best_answers, gold_answers):
        pred_em_score += compute_exact_match(pred_answer, gold_answer)
        best_em_score += compute_exact_match(best_answer, gold_answer)
        pred_f1_score += compute_f1(pred_answer, gold_answer)
        best_f1_score += compute_f1(best_answer, gold_answer)
        
    
    results = {
        "pred_acc": pred_acc,
        "best_acc": best_acc,
        "pred_em": pred_em_score / len(pred_answers),
        "best_em": best_em_score / len(best_answers),
        "pred_f1": pred_f1_score / len(pred_answers),
        "best_f1": best_f1_score / len(best_answers)
    }

    return results

In [None]:
def select_best_answer_span(start_probs, end_probs, distance=20):
    '''given 2 matrices of probabilities associated with 
    indicies of a text being the start or end of an answer spans, respectively,
    finds the highest probability spans under the restriction that the end index must be no more 
    than distance after the start. Returns a list (start index, end index) 2-plues
    corresponding to the best solution for each row of start/end_probs'''
    best_starts = np.argsort(start_probs * -1, axis=1)
    best_ends = np.argsort(end_probs * -1, axis=1)
    output_spans = []
    for i in range(len(start_probs)):
        step = 0
        found = False
        sorted_spans = []
        bound = 0
        while not found:
            sorted_spans.extend(
                [(start_probs[i, best_starts[i,j]] + end_probs[i,best_ends[i,step]], 
                                     best_starts[i,j], best_ends[i,step]) for j in range(step + 1)]
                                )
            sorted_spans.extend([(start_probs[i, best_starts[i,step]] + end_probs[i,best_ends[i,j]], 
                                     best_starts[i,step], best_ends[i,j]) for j in range(step)])
              
            sorted_spans.sort()
            bound = max(start_probs[i,best_starts[i, 0]] + end_probs[i, best_ends[i, step]], 
                        start_probs[i,best_starts[i, step]] + end_probs[i, best_ends[i, 0]])
            
            step += 1

            while sorted_spans and sorted_spans[-1][0] >= bound and not found:
                curr = sorted_spans.pop()
                if curr[1] <= curr[2] <= curr[1] + distance:
                    found = (curr[1], curr[2])
                    
        output_spans.append(found)
    return output_spans


def to_list(tensor):
    """
    Helper function to convert tensors into lists.
    """
    return tensor.detach().cpu().tolist()


def get_pred_answers(batch_ids, spans, tokenizer):
    """Takes token text ids and spans from a given batch and outputs text answers."""
    answers = []
    for i in range(len(spans)):
        answers.append(tokenizer.decode(batch_ids[i, spans[i][0]: spans[i][1] + 1]))
    return answers


def normalize_text(s):
    """
    Applies regular text processing techniques on a string.
    Pre-requisite for computing F1 scores between true and pred answers. 
    """
    def remove_articles(text):
        regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
        return re.sub(regex, " ", text)

    def white_space_fix(text):
        return " ".join(text.split())

    def remove_punc(text):
        exclude = set(string.punctuation)
        return "".join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    output_text = white_space_fix(remove_articles(remove_punc(lower(s))))
    
    return output_text


def compute_exact_match(prediction, gold):
    return int(normalize_text(prediction) == normalize_text(gold))


def compute_f1(prediction, gold):
    """
    Computes F1 Score on top of the words included in pred and gold answers
      - Precision: Proportion of common tokens over total predicted tokens
      - Recall: Proportion of common tokens over total gold tokens
    """
    pred_tokens = normalize_text(prediction).split()
    gold_tokens = normalize_text(gold).split()
    
    # if either the prediction or the gold is no-answer then f1 = 1 if they agree, 0 otherwise
    if len(pred_tokens) == 0 or len(gold_tokens) == 0:
        return int(pred_tokens == gold_tokens)
    
    common_tokens = set(pred_tokens) & set(gold_tokens)
    
    # if there are no common tokens then f1 = 0
    if len(common_tokens) == 0:
        return 0
    
    prec = len(common_tokens) / len(pred_tokens)
    rec = len(common_tokens) / len(gold_tokens)
    
    return 2 * (prec * rec) / (prec + rec)

In [None]:
progress_bar = tqdm(range(NUM_TRAINING_STEPS))
t_epoch, t_batch = time.time(), time.time()

for epoch in range(EPOCHS):
    print(f"Epoch: {epoch + 1}/{EPOCHS}")
    epoch_loss = train(model, optimizer, lr_scheduler, criterion, train_iter, t_batch)
    dev_results = evaluate(model, tokenizer, dev_iter, dev_gold_answers)
    print(f"After Epoch: {epoch}")
    print(f"Loss: {epoch_loss}")
    print(f"Pred acc: {dev_results['pred_acc']}")
    print(f"Best acc: {dev_results['best_acc']}")
    print(f"Pred EM: {dev_results['pred_em']}")
    print(f"Best EM: {dev_results['best_em']}")
    print(f"Pred F1: {dev_results['pred_f1']}")
    print(f"Best F1: {dev_results['best_f1']}")
    
    # stores model checkpoint
    torch.save({
              'epoch': epoch,
              'model_state_dict': model.state_dict(),
              'optimizer_state_dict': optimizer.state_dict(),
              'lr_scheduler_state_dict': lr_scheduler.state_dict(),
              'loss': epoch_loss,
              }, ckpt_path + f"ckpt_{epoch}.pt")
    
    print(f"Model state saved - epoch: {epoch}")
    print(f"{'=' * 20}")

total_time = time.time() - t_epoch
print(f"Training finished! Total time: {str(timedelta(seconds=total_time))}")

  0%|          | 0/14544 [00:00<?, ?it/s]

Epoch: 1/3
Processed 0 QA pairs of 77558
Last loss: 12.631551742553711
Time_elapsed: 0:00:00.847060
Processed 320 QA pairs of 77558
Last loss: 12.541471481323242
Time_elapsed: 0:00:16.819638
Processed 640 QA pairs of 77558
Last loss: 12.586880683898926
Time_elapsed: 0:00:33.054146
Processed 960 QA pairs of 77558
Last loss: 12.415403366088867
Time_elapsed: 0:00:49.342211
Processed 1280 QA pairs of 77558
Last loss: 12.364994049072266
Time_elapsed: 0:01:05.793931
Processed 1600 QA pairs of 77558
Last loss: 12.22736930847168
Time_elapsed: 0:01:22.409758
Processed 1920 QA pairs of 77558
Last loss: 12.020681381225586
Time_elapsed: 0:01:38.993349
Processed 2240 QA pairs of 77558
Last loss: 11.840444564819336
Time_elapsed: 0:01:55.641556
Processed 2560 QA pairs of 77558
Last loss: 11.04861068725586
Time_elapsed: 0:02:12.373327
Processed 2880 QA pairs of 77558
Last loss: 10.104564666748047
Time_elapsed: 0:02:29.178606
Processed 3200 QA pairs of 77558
Last loss: 9.424127578735352
Time_elapsed: 0

  0%|          | 0/366 [00:00<?, ?it/s]

After Epoch: 0
Loss: 35678.11328125
Pred acc: 0.6247010591048856
Best acc: 0.6236761188930646
Pred EM: 0.8795695251110351
Best EM: 0.5222070379227879
Pred F1: 0.9231976335339813
Best F1: 0.6849603747235941
Model state saved - epoch: 0
Epoch: 2/3
Processed 0 QA pairs of 77558
Last loss: 2.5440421104431152
Time_elapsed: 1:11:12.814811
Processed 320 QA pairs of 77558
Last loss: 2.6152939796447754
Time_elapsed: 1:11:29.601599
Processed 640 QA pairs of 77558
Last loss: 1.8842384815216064
Time_elapsed: 1:11:46.474676
Processed 960 QA pairs of 77558
Last loss: 2.4795351028442383
Time_elapsed: 1:12:03.401732
Processed 1280 QA pairs of 77558
Last loss: 1.988048791885376
Time_elapsed: 1:12:20.349931
Processed 1600 QA pairs of 77558
Last loss: 2.3194291591644287
Time_elapsed: 1:12:37.317107
Processed 1920 QA pairs of 77558
Last loss: 2.246063232421875
Time_elapsed: 1:12:54.344519
Processed 2240 QA pairs of 77558
Last loss: 1.7046608924865723
Time_elapsed: 1:13:11.405376
Processed 2560 QA pairs of

  0%|          | 0/366 [00:00<?, ?it/s]

After Epoch: 1
Loss: 18581.859375
Pred acc: 0.6361462248035531
Best acc: 0.6346088144858216
Pred EM: 0.8795695251110351
Best EM: 0.5213529210796037
Pred F1: 0.9231976335339813
Best F1: 0.6898116642313796
Model state saved - epoch: 1
Epoch: 3/3
Processed 0 QA pairs of 77558
Last loss: 1.5196056365966797
Time_elapsed: 2:22:40.263262
Processed 320 QA pairs of 77558
Last loss: 1.409246563911438
Time_elapsed: 2:22:57.734644
Processed 640 QA pairs of 77558
Last loss: 1.3519692420959473
Time_elapsed: 2:23:15.746069
Processed 960 QA pairs of 77558
Last loss: 1.9611228704452515
Time_elapsed: 2:23:33.499154
Processed 1280 QA pairs of 77558
Last loss: 1.296640157699585
Time_elapsed: 2:23:50.830645
Processed 1600 QA pairs of 77558
Last loss: 1.5063762664794922
Time_elapsed: 2:24:08.158121
Processed 1920 QA pairs of 77558
Last loss: 1.390854835510254
Time_elapsed: 2:24:25.719316
Processed 2240 QA pairs of 77558
Last loss: 1.1184298992156982
Time_elapsed: 2:24:43.353453
Processed 2560 QA pairs of 77

  0%|          | 0/366 [00:00<?, ?it/s]

After Epoch: 2
Loss: 12998.9130859375
Pred acc: 0.6474205671335839
Best acc: 0.6436624530235736
Pred EM: 0.8795695251110351
Best EM: 0.530064912880082
Pred F1: 0.9231976335339813
Best F1: 0.6981181554980165
Model state saved - epoch: 2
Training finished! Total time: 3:35:16.075622


# Model Evaluation and Inference

In [None]:
model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased").to(device)
ckpt_path = "/content/gdrive/MyDrive/Colab Notebooks/open_domain_QA_BERT/ckpt/"
checkpoint = torch.load(ckpt_path + "ckpt_2.pt")
model.load_state_dict(checkpoint['model_state_dict'])

<All keys matched successfully>

In [None]:
squad_path = '/content/gdrive/MyDrive/Colab Notebooks/open_domain_QA_BERT/data/'
test_questions, test_contexts, test_gold_answers = read_files(squad_path, "test")
test_results = evaluate(model, tokenizer, test_iter, test_gold_answers)

  0%|          | 0/500 [00:00<?, ?it/s]

In [None]:
test_results

{'best_acc': 0.6685,
 'best_em': 0.5695,
 'best_f1': 0.714791674841538,
 'pred_acc': 0.672,
 'pred_em': 0.8675,
 'pred_f1': 0.91526472890887}

In [None]:
from transformers import pipeline

model = DistilBertForQuestionAnswering.from_pretrained("distilbert-base-uncased")
ckpt_path = "/content/gdrive/MyDrive/Colab Notebooks/open_domain_QA_BERT/ckpt/"
checkpoint = torch.load(ckpt_path + "ckpt_2.pt")
model.load_state_dict(checkpoint['model_state_dict'])
QA_BERT_model = pipeline("question-answering", model=model, tokenizer=tokenizer)

In [None]:
test_question = "What is the difference between Work 4.0 and the other phases of work relations?"
test_context = """
Conceptually, Work 4.0 reflects the current fourth phase of work relations, having been preceded by 
the birth of industrial society and the first workers' organizations in the late 18th century (Work 1.0), 
the beginning of mass production and of the welfare state in the late 19th century (Work 2.0), 
and the advent of globalization, digitalization and the transformation of the social market economy since 
the 1970s (Work 3.0). By contrast, Work 4.0 is characterized by a high degree of integration and cooperation, 
the use of digital technologies (e.g. the internet), and a rise in flexible work arrangements. 
Its drivers include digitalization, globalization, demographic change (ageing, migration), and cultural change.
"""

QA_BERT_model(question=test_question, context=test_context)

{'answer': 'high degree of integration and cooperation,',
 'end': 522,
 'score': 0.13065877556800842,
 'start': 478}

In [None]:
test_question2 = "How is COVID-19 transmitted?"
test_context2 = """
COVID‑19 transmits when people breathe in air contaminated by droplets and small airborne particles containing 
the virus. The risk of breathing these in is highest when people are in close proximity, but they can be i
nhaled over longer distances, particularly indoors. Transmission can also occur if splashed or sprayed with 
contaminated fluids in the eyes, nose or mouth, and, rarely, via contaminated surfaces. 
People remain contagious for up to 20 days, and can spread the virus even if they do not develop symptoms.
"""

QA_BERT_model(question=test_question2, context=test_context2)

{'answer': 'when people breathe in air contaminated by droplets and small airborne particles containing the virus.',
 'end': 123,
 'score': 0.4730822443962097,
 'start': 20}

In [None]:
test_question3 = "Why is the sky blue?"
test_context3 = """
The Earth's atmosphere scatters short-wavelength light more efficiently than that of longer wavelengths. 
Because its wavelengths are shorter, blue light is more strongly scattered than the longer-wavelength lights, 
red or green. Hence the result that when looking at the sky away from the direct incident sunlight, 
the human eye perceives the sky to be blue.
"""
QA_BERT_model(question=test_question3, context=test_context3)

{'answer': 'Because its wavelengths are shorter,',
 'end': 143,
 'score': 0.2641790509223938,
 'start': 107}

In [None]:
test_question4 = "Can the sky ever be green?"
QA_BERT_model(question=test_question4, context=test_context3)

{'answer': 'blue light is more strongly scattered than the longer-wavelength lights, red',
 'end': 221,
 'score': 0.0901738703250885,
 'start': 144}

In [None]:
test_question5 = "What is our projected CAGR for the next 5 years?"
test_context5 = """
Our CFO estimates strong revenue growth forcasts for the next five years. 
According to our calculations, and assuming market conditions hold, we can expect to achieve a 25% CAGR
for this period."""

QA_BERT_model(question=test_question5, context=test_context5)

{'answer': '25%', 'end': 174, 'score': 0.29187580943107605, 'start': 171}

# System limitations

* Assumes that each question only has one possible response.
* Doesn't provide empty answers when answer is not contained in the context.
* System interactions are stateless (i.e., model processes each query independently)
* User cannot correct the system