# Task

* This assignment focuses on building a simple **Extractive Question Answering** model using **Huggingface** and **Pytorch**. 
Given a context paragraph and a question based on it, the task is to extract the answer from the context.

* The main aim of the assignment is to be familiar with the basic coding concepts in Huggingface and design an inference pipeline for QA. 

* You are required to do the following things: 
    * Download a pretrained model from [Huggingface Model Hub](https://huggingface.co/models).
    * Design the pre-processing pipeline.
    * Design the post-processing pipeline.
    * Perform inference on [SQuAD 2.0](https://arxiv.org/abs/1806.03822) dataset.
    * Get the results on the blind test set.


* Students are required to complete the coding sections which have been marked with `#TODO`.


# Installations

First we need to install 🤗 Transformers, 🤗 Datasets, and 🤗 evaluate libraries from Huggingface.

In [1]:
! pip install datasets transformers evaluate
! pip install torchvision 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m462.8/462.8 KB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.26.0-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m96.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 KB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.12.0-py3-none-any.whl (190 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m25.3 MB/s[0m eta [36m

# Imports

We start by importing necessary libraries.

In [2]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import random
import collections
from tqdm import tqdm
import os
import json

import torch
from torch.utils.data import (
    DataLoader,
    Dataset
)

from datasets import load_dataset
from evaluate import load

from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering
)

## Setting up the GPU

Following that, we find the available GPUs and save the information in the `DEVICE` variable. This will be useful later on when we need to move tensors and models from CPU to GPU and vice versa.

In [3]:
if torch.cuda.is_available():
    DEVICE = torch.device("cuda")
    print("Using GPU: ", DEVICE)
else:
    DEVICE = torch.device("cpu")
    print("Using CPU: ", DEVICE)

Using GPU:  cuda


## Seeding the code

In [4]:
def set_random_seed(seed: int):
    """
    Helper function to seed experiment for reproducibility.
    If -1 is provided as seed, experiment uses random seed from 0~9999
    Args:
        seed (int): integer to be used as seed, use -1 to randomly seed experiment
    """
    print("Seed: {}".format(seed))

    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.enabled = False
    torch.backends.cudnn.deterministic = True

    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

In [5]:
SEED = 0
set_random_seed(SEED)

Seed: 0


# Preprocessing

We start by declaring a config dictionary containing all the hyperparameters required for data preprocessing and model inference. They are as follows:

1. `model_checkpoint`: The model to be used for from [Huggingface Model Hub](https://huggingface.co/models) for our QA task. We recommend using models the [RoBERTa Model](https://huggingface.co/deepset/roberta-base-squad2) that is already fine-tuned on the SQuAD datasets for good performance.
2. `max_length`: The maximum length of the input sequence. If left unset, the tokenizer will use the predefined model maximum length. (Ideally set between 300-512)
3. `truncation`: Determines whether to truncate the input sequence or not. See the documentation for details on the different values it can accept.
4. `padding`: Determines whether to pad the input sequence or not. See the documentation for details on the different values it can accept.
5. `return_overflowing_tokens`: Determines whether to return overflowing token sequences after truncation/ when the input sequence exceeds the maximum length.
6. `return_offsets_mapping`: Determines whether to return (char_start, char_end) for each token in the input sequence.
7. `stride`: The number of overlapping tokens between the truncated and the overflowing sequences.
8. `n_best_size`: The top 'n' answers to select from the predictions.
9. `max_answer_length`: The maximum length of the answer.
10. `batch_size`: The number of examples to be included in each batch. It should be selected properly such that the batch fits into the GPU. (Ideally from 16 to 128)

Check out the following links for more information on Tokenizers in huggingface.
1. [Summary of Tokenizers](https://huggingface.co/docs/transformers/v4.24.0/en/tokenizer_summary)
2. [Padding and Truncation](https://huggingface.co/docs/transformers/v4.24.0/en/pad_truncation)
3. [Batch Encoding](https://huggingface.co/docs/transformers/v4.24.0/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_encode_plus)

In [6]:
# TODO 1: fill in the values for all the hyper-paramters mentioned in the config dictionary.
config = {
    'model_checkpoint': "roberta-base",
    "max_length": 400,
    "truncation": "longest_first",
    "padding": True,
    "return_overflowing_tokens": True,
    "return_offsets_mapping": True,
    "stride": 128,
    "n_best_size": 33,
    "max_answer_length": 50,
    "batch_size": 96
}

## Loading the Dataset

For this assignment, we will be using [SQuAD](https://arxiv.org/abs/1606.05250), an academic benchmark for extractive question answering. We will use the [SQuAD 2.0](https://arxiv.org/abs/1806.03822), an updated version of the dataset containing harder examples as well as examples which do not have answers in the context.

We will use the `load_dataset` function from [🤗 Datasets](https://github.com/huggingface/datasets) library to load the dataset.

The `load_dataset` function returns a [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict) object, which conatins the *train* and *validation* splits for the dataset. We will be using only the *validation* split in this assignment.

In [7]:
%%time
datasets = load_dataset("squad_v2")
datasets

Downloading builder script:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.40k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.02k [00:00<?, ?B/s]

Downloading and preparing dataset squad_v2/squad_v2 to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/9.55M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/801k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/130319 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11873 [00:00<?, ? examples/s]

Dataset squad_v2 downloaded and prepared to /root/.cache/huggingface/datasets/squad_v2/squad_v2/2.0.0/09187c73c1b837c95d9a249cd97c2c3f1cebada06efe667b4427714b27639b1d. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

CPU times: user 18 s, sys: 484 ms, total: 18.5 s
Wall time: 33.4 s


DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 130319
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 11873
    })
})

## Dataset Inspection

You can select any example in the dataset by specifying the split and the example index.
The code below prints a randomly selected example.

In [8]:
index = random.randint(0, len(datasets['validation']))
datapoint = datasets['validation'][index]
print(f"index: {index}\n")
for column, info in datapoint.items():
    print(f"\n{column}:\t{info}")

index: 6311


id:	5727c3b02ca10214002d95ba

title:	Harvard_University

context:	Charles W. Eliot, president 1869–1909, eliminated the favored position of Christianity from the curriculum while opening it to student self-direction. While Eliot was the most crucial figure in the secularization of American higher education, he was motivated not by a desire to secularize education, but by Transcendentalist Unitarian convictions. Derived from William Ellery Channing and Ralph Waldo Emerson, these convictions were focused on the dignity and worth of human nature, the right and ability of each person to perceive truth, and the indwelling God in each person.

question:	What president eliminated the Christian position in the curriculum?

answers:	{'text': ['Charles W. Eliot', 'Charles W. Eliot', 'Charles W. Eliot'], 'answer_start': [0, 0, 0]}


We can see that each example contains 4 fields, namely:
1.  ***id*** (a unique identifier for each example)
2. ***title*** (the genre of the example)
3. ***context*** (The paragraph on which the question is asked)
4. ***question*** (the actual question the ML model needs to answer)
5. ***answer*** (the actual answer to the question indicated in text as well as its start and end position in the context). During training, there is only one possible answer. For evaluation, however, there are several possible answers for each sample, which may be the same or different.


## Loading the Tokenizer and the Model

Next, we download the RoBERTa model fine-tuned on SQuAD along with its tokenizer. Refer to the [Huggingface documentation](https://huggingface.co/docs/transformers/autoclass_tutorial) to see how to load pretrained models and tokenizers.

The 🤗 Transformers `Tokenizer` tokenizes the input sequence and converts the tokens to their corresponding IDs in the pretrained vocabulary. It generates various inputs that a model requires such as input_ids, attention_mask, token_type_ids, etc. You can read more details about this in the Huggingface [Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) and [Preprocessing](https://huggingface.co/docs/transformers/preprocessing) documentation.

The 🤗 Transformers `AutoModelForQuestionAnswering` is a transformer model with a span classification head for extractive question answering. It returns ***start_logits*** and ***end_logits***, marking the start and end of the answer, respectively. More details on this model is given [here](https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaForQuestionAnswering).

In [9]:
%%time
# TODO 2: Define the tokenizer and QA model. Transfer the QA model to GPU.
from transformers import AutoTokenizer, RobertaForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")
qa_model  =  RobertaForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2")
qa_model     = qa_model.to(DEVICE)


Downloading (…)okenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/496M [00:00<?, ?B/s]

CPU times: user 3.88 s, sys: 2.07 s, total: 5.95 s
Wall time: 30.8 s


## Dataset Class

Next we define the a custom Dataset class for our SQuAD corpus.

The class implements three main functions: 
1. `__init__`: This function is run once when instantiating the Dataset object. We generally initialize our raw dataset, tokenizer, and tokenized dataset in this function. 
2. `__len__`: This returns the total number of examples in our dataset. Note that we set it to the number of available ***input_ids*** and not the size of the raw dataset. This is because we are allowing context longer than maximum length of the model, resulting in increased number of features.
3. `__getitem__`: This function return the sample from our dataset at a given index. You are supposed to implement this function.

You don't need to understand the details of each of them for the purpose of this assignment. However, if you want to learn more, you can lookup the [PyTorch Documentation](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html).

### Important steps for preprocessing QA data: (`TODO`)

1. As QA task contains two input fields of the question and the context, we concatenate both of them to pass it to the model. Thus, the input is `[CLS] Question [SEP] Context [SEP]`. Note that we never want to truncate the question, only the context. So choose the `truncation` hyper-parameter in the [config cell](https://colab.research.google.com/drive/1HQ9z8cZE8TgLjlkAekEXfih9x9byB65r#scrollTo=8f323c24&line=2&uniqifier=1) accordingly.
2. As a result, in the case of very long documents, we must be careful not to lose the context that contains the answer. To resolve this concern, we will allow longer examples in our dataset to provide multiple input features, each of which is shorter than the maximum length (set as a hyper-parameter in the config dictionary). This can be done using the `stride` and `return_overflowing_tokens` hyper-parameters.
3. The `sequence ids` in the tokenized input can be used to distinguish between the various sequences in an input example. In our case, the question will be assigned 0 and the context will be assigned 1, because the former comes after the latter in the sequence. We know that the answer tokens always lie in the context. Hence, to make things easy for post-processing, we set the offset mapping of the tokens that are not a part of the context to -1.
4. The `overflow_to_sample_mapping` key return by the tokenizer is useful to map each feature we get to its corresponding label.

In [10]:
class QADataset(Dataset):
    
    def __init__(
        self,
        data,
        tokenizer,
        config
    ):

        self.config = config
        self.data = data
        self.tokenizer = tokenizer
        self.tokenized_data = self.tokenizer(
            self.data["question"],
            self.data["context"],
            max_length=self.config["max_length"],
            stride=self.config["stride"],
            truncation=self.config["truncation"],
            padding=self.config["padding"],
            return_overflowing_tokens=self.config["return_overflowing_tokens"],
            return_offsets_mapping=self.config["return_offsets_mapping"],
            return_attention_mask=True,
            add_special_tokens=True
        )
        
        example_ids = []
  
        for i, sample_mapping in enumerate(tqdm(self.tokenized_data["overflow_to_sample_mapping"])):
            example_ids.append(self.data["id"][sample_mapping])

            sequence_ids = self.tokenized_data.sequence_ids(i)
            offset_mapping = self.tokenized_data["offset_mapping"][i]

            
        #     # TODO 3: set the offset mapping of the tokenized data at index i to (-1, -1) 
        #     # if the token is not in the context
            updated_offset_mapping = []
            for mapping in  offset_mapping:
              if mapping == (0,0):
                updated_offset_mapping.append((-1,-1))
              else:
                  updated_offset_mapping.append(mapping)

            self.tokenized_data["offset_mapping"][i] = updated_offset_mapping

        self.tokenized_data["ID"] = example_ids
        
        
        
    def __len__(
        self
    ):
        # TODO 4: define the length of the dataset equal to total number of unique features (not the total number of datapoints)
        return len(self.tokenized_data["input_ids"])
        
    
    
    def __getitem__(
        self,
        index: int
    ):
        # TODO 5: Return the tokenized dataset at the given index. Convert the various inputs to tensor using torch.tensor
        return {
            'input_ids': torch.Tensor(self.tokenized_data["input_ids"][index]),
            'attention_mask': torch.Tensor(self.tokenized_data["attention_mask"][index]),
            'offset_mapping': torch.Tensor(self.tokenized_data["offset_mapping"][index]),
            'example_id': self.tokenized_data["ID"][index],
        }

## Creating Dataloader

1. We create an object of our custom QADataset class.
2. To access examples batch-wise, we create a [Dataloader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) object that is an iterable around the Dataset object. The `Dataloader` takes the `Dataset object` and `batch size` as parameters. There are many other parameters that one can specify but we only need batch size for this assignment.
3. Note that the length of the Dataloader object multiplied by the batch size should approximately give you the size of the entire dataset.

In [11]:
%%time
eval_dataset = QADataset(
      data=datasets['validation'],
      tokenizer=tokenizer,
      config=config
  )


100%|██████████| 12118/12118 [01:54<00:00, 105.58it/s]

CPU times: user 2min, sys: 862 ms, total: 2min 1s
Wall time: 1min 59s





In [12]:
%%time
eval_dataloader = DataLoader(
    eval_dataset,
    batch_size=config["batch_size"]
)
len(eval_dataloader)

CPU times: user 416 µs, sys: 0 ns, total: 416 µs
Wall time: 685 µs


127

We collect the raw and tokenized dataset in seperate variables as they will be required during post-processing.

In [13]:
eval_data = eval_dataset.data
eval_features = eval_dataset.tokenized_data

# Inference on SQuAD (`TODO`)

* In the cell below you are supposed to perform inference on the SQuAD valdiation set.
* You are supposed to iterate over the DataLoader object, pass the tokenized input to the model, and store the start and end logits.
* Note: Do not forget to transfer the start and end logits tensor from GPU to CPU. Convert them to numpy arrays.

In [14]:
def qa_inference(model, data_loader):
    model.eval()
    start_logits = []
    end_logits = []
    for step, batch in enumerate(tqdm(data_loader, desc="Inference Iteration")):
        with torch.no_grad():
            model_kwargs = {
                'input_ids': batch['input_ids'].to(DEVICE, dtype=torch.long),
                'attention_mask': batch['attention_mask'].to(DEVICE, dtype=torch.long)
            }    

            # TODO 6: pass the model arguments to the model and store the output
            outputs = model(**model_kwargs)
            # TODO 7: Extract the start and end logits by extending `start_logits` and `end_logits`
            start_scores = outputs.start_logits.cpu().numpy()
            end_scores = outputs.end_logits.cpu().numpy()
            # print(start_scores.cpu())
            start_logits.extend(start_scores)
            end_logits.extend(end_scores)

    # TODO 8: Convert the start and end logits to a numpy array (by passing them to `np.array`)
    start_logits = np.array(start_logits)
    # TODO 9: return start and end logits
    end_logits = np.array(end_logits)

    return start_logits, end_logits

In [15]:
%%time
start_logits, end_logits = qa_inference(qa_model, eval_dataloader)
# start_logits.shape, end_logits.shape

Inference Iteration: 100%|██████████| 127/127 [04:21<00:00,  2.06s/it]

CPU times: user 4min 19s, sys: 704 ms, total: 4min 20s
Wall time: 4min 21s





# Postprocessing

The predictions could give rise to various difficulties:
1. The answer span could be the text in the question.
2. Answer would be too long.
3. The start position could be greater than the end position.

We have to do the following postprocessing steps to avoid the abouve mentioned senarios:
1. Skip answers that are not fully in the context (Hint: make use of the modified offset mapping done in the [preprocessing step](https://colab.research.google.com/drive/1HQ9z8cZE8TgLjlkAekEXfih9x9byB65r#scrollTo=916ff6d7&line=1&uniqifier=1)).
2. To select the best possible start and end logits, first sort them and select the top 'n' choices using the `n_best_size` hyper-paramter. Then iterate over the start and end logits and skip the answers with a length that is either < 0 or > `max_answer_length`.

In [16]:
def post_processing(raw_dataset, tokenized_dataset, start_logits, end_logits):
    
    # Map each example to its features. This is done because an example can have multiple features
    # as we split the context into chunks if it exceeded the max length
    data2features = collections.defaultdict(list)
    for idx, feature_id in enumerate(tokenized_dataset['ID']):
        data2features[feature_id].append(idx)

    # Decode the answers for each datapoint
    predictions = []
    for data in tqdm(raw_dataset):
        answers = []
        data_id = data["id"]
        context = data["context"]

        for feature_index in data2features[data_id]:

            # TODO 10: Get the start logit, end logit, and offset mapping for each index.
            start_logit = start_logits[feature_index]
            end_logit = end_logits[feature_index]

            # TODO 11: Sort the start and end logits and get the top n_best_size logits.
            # Hint: look at other QA pipelines/tutorials.
            start_indexes = np.argsort(start_logit)[-config["n_best_size"]:]
            end_indexes = np.argsort(start_logit)[-config["n_best_size"]:]
            
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # TODO 12: Exclde answers that are not in the context
                    if start_index > end_index:
                      continue
                    if end_index > len(context) - 1:
                      continue
                    # TODO 13: Exclude answers if (answer length < 0) or (answer length > max_answer_length)
                    if end_index - start_index < 0:
                      continue
                    if end_index - start_index > config['max_answer_length']:
                      continue
                    # # TODO 14: collect answers in a list.
                    offset_mapping = tokenized_dataset["offset_mapping"][feature_index]
                    
                    answers.append(
                        {
                            "text": context[offset_mapping[start_index][0]: offset_mapping[end_index][1]],
                            "logit_score": start_logit[start_index] + end_logit[end_index],
                        }
                    )

        best_answer = max(answers, key=lambda x: x["logit_score"])
        predictions.append(
            {
                "id": data_id, 
                "prediction_text": best_answer["text"],
                "no_answer_probability": 0.0 if len(best_answer["text"]) > 0 else 1.0
            }
        )    
    return predictions

In [17]:
%%time
predicted_answers = post_processing(
    raw_dataset=eval_data, 
    tokenized_dataset=eval_features,
    start_logits=start_logits,
    end_logits=end_logits
)

100%|██████████| 11873/11873 [00:13<00:00, 859.47it/s]

CPU times: user 13.8 s, sys: 73.4 ms, total: 13.9 s
Wall time: 13.8 s





In [18]:
gold_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in eval_data][:len(eval_data)]

In [19]:
assert len(predicted_answers) == len(gold_answers)

In [20]:
print(predicted_answers[0])
print(gold_answers[0])

{'id': '56ddde6b9a695914005b9628', 'prediction_text': 'France', 'no_answer_probability': 0.0}
{'id': '56ddde6b9a695914005b9628', 'answers': {'text': ['France', 'France', 'France', 'France'], 'answer_start': [159, 159, 159, 159]}}


## Evaluating the Predictions

We use the `🤗 Evaluate` library from Huggingace for evaluating our predictions. 
Specifically, we evaluate the model based on two metrics:
1. `exact match`: This metric measures the percentage of predictions that match any one of the ground truth answers exactly.
2. `macro-averaged f1 score`: This metric mea- sures the average overlap between the prediction and ground truth answer. We treat the prediction and ground truth as bags of tokens, and compute their F1. We take the maximum F1 over all of the ground truth answers for a given question, and then average over all of the questions.

In [21]:
%%time
eval_metric = load("squad_v2")

Downloading builder script:   0%|          | 0.00/6.47k [00:00<?, ?B/s]

Downloading extra modules:   0%|          | 0.00/11.3k [00:00<?, ?B/s]

CPU times: user 128 ms, sys: 12.1 ms, total: 140 ms
Wall time: 4.65 s


In [22]:
eval_results = eval_metric.compute(predictions=predicted_answers, references=gold_answers)
eval_results

{'exact': 79.87029394424324,
 'f1': 82.84272818590236,
 'total': 11873,
 'HasAns_exact': 77.15924426450742,
 'HasAns_f1': 83.11263693509112,
 'HasAns_total': 5928,
 'NoAns_exact': 82.57359125315391,
 'NoAns_f1': 82.57359125315391,
 'NoAns_total': 5945,
 'best_exact': 79.87029394424324,
 'best_exact_thresh': 0.0,
 'best_f1': 82.84272818590223,
 'best_f1_thresh': 0.0}

Save the SQuAD results as a json file. Make use to name the file as `squad_results.json`. 

Make sure to download the json file and upload on gradscope with the same name.

In [23]:
#TODO 15: save the metric results in squad_results.json file
from google.colab import drive

drive.mount('/content/gdrive/', force_remount=True)
with open("squad_results.json", "w") as outfile:
    json.dump(eval_results, outfile)

Mounted at /content/gdrive/


# Blind Test Set

Now you will be given a blind test set for which you need to generate appropriate predictions using the functions given above.

You should be able to use code snippets from the SQuAD evaluation section.

## Load and preprocess the dataset

You can load the blind test set just like the SQuAD corpus using the `load_dataset` function from the `🤗 Datasets` library. 

More information on how to load a csv file using the load_dataset function is given [here](https://huggingface.co/docs/datasets/loading#csv).

In [25]:
# TODO 16: Load the blind test dataset using the load_dataset function, make sure to mention the split as `train`.
test_dataset = load_dataset("csv", data_files="/content/blind_test_set.csv")["train"]



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-b61f347914c4439c/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-b61f347914c4439c/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

The code below prints a randomly selected example in the test set.

As you can see, the test data only contains id, title, context, and question and not the answers. Your job is to generate appropriate answers for the blind test dataset.

In [26]:
%%time
index = random.randint(0, len(test_dataset))
datapoint = test_dataset[index]
print(f"index: {index}\n")
for column, info in datapoint.items():
    print(f"\n{column}:\t{info}")

index: 1722


id:	30184e1d4b5a5eb3587525795497cd6232c2434d

title:	Genghis_Khan

context:	In Mongolia today, Genghis Khan's name and likeness are endorsed on products, streets, buildings, and other places. His face can be found on everyday commodities, from liquor bottles to candy products, and on the largest denominations of 500, 1,000, 5,000, 10,000, and 20,000 Mongolian tögrög (₮). Mongolia's main international airport in Ulaanbaatar is named Chinggis Khaan International Airport. Major Genghis Khan statues have been erected before the parliament and near Ulaanbaatar. There have been repeated discussions about regulating the use of his name and image to avoid trivialization.

question:	On what consumable products might you see an image of Genghis Khan?
CPU times: user 1.33 ms, sys: 5 µs, total: 1.34 ms
Wall time: 1.13 ms


* Wrap the test data into the custom QADataset object.
* Create a dataloader to loop over the dataset.

In [27]:
# TODO 17: Define the QADataset object for the test data.
%%time
eval_dataset = QADataset(
      data=test_dataset,
      tokenizer=tokenizer,
      config=config
  )


100%|██████████| 3009/3009 [00:07<00:00, 418.44it/s]

CPU times: user 9.26 s, sys: 176 ms, total: 9.44 s
Wall time: 8.66 s





In [28]:
# TODO 18: Define the dataloader for the test set 
%%time
eval_dataloader = DataLoader(
    eval_dataset,
    batch_size=config["batch_size"]
)
len(eval_dataloader)

CPU times: user 131 µs, sys: 2 µs, total: 133 µs
Wall time: 136 µs


32

Collect the raw and tokenized dataset in seperate variables as they will be required during post-processing.

In [29]:
# TODO 19: Save the raw and tokenized test set into seperate variables.
eval_data = eval_dataset.data
eval_features = eval_dataset.tokenized_data

## Inference on Test Set

Use the `qa_inference` function to generate the start and end logits.

In [30]:
# TODO 20: perfom inference on the blind test to get the start and end logits (use the qa_inference function)
start_logits, end_logits = qa_inference(qa_model, eval_dataloader)

Inference Iteration: 100%|██████████| 32/32 [01:02<00:00,  1.96s/it]


Use the `post_processing` function to generate the final candidate answers.

In [31]:
# TODO 21: post process the predictions to generate the candidate answers.
%%time
predicted_answers = post_processing(
    raw_dataset=eval_data, 
    tokenized_dataset=eval_features,
    start_logits=start_logits,
    end_logits=end_logits
)

100%|██████████| 3000/3000 [00:03<00:00, 931.18it/s]

CPU times: user 3.22 s, sys: 16.2 ms, total: 3.24 s
Wall time: 3.23 s





Save the results as a json file. Make sure to name the file as `blind_test_predictions.json`.

In [32]:
#TODO 22: Save the candidate answers in `blind_test_predictions.json` file.

In [33]:
print(predicted_answers[0])
with open("blind_test_predictions.json", "w") as outfile:
    json.dump(predicted_answers, outfile)

{'id': '100303db73e4051089035f246d0aeef2b12c4e47', 'prediction_text': 'Town Moor', 'no_answer_probability': 0.0}
