# Task

* This assignment focuses on building a simple **Extractive Question Answering** model using **Huggingface** and **Pytorch**. 
Given a context paragraph and a question based on it, the task is to extract the answer from the context.

* The main aim of the assignment is to be familiar with the basic coding concepts in Huggingface and design an inference pipeline for QA. 

* You are required to do the following things: 
    * Download a pretrained model from [Huggingface Model Hub](https://huggingface.co/models).
    * Design the pre-processing pipeline.
    * Design the post-processing pipeline.
    * Perform inference on [SQuAD 2.0](https://arxiv.org/abs/1806.03822) dataset.
    * Get the results on the blind test set.


* Students are required to complete the coding sections which have been marked with `#TODO`.


# Installations

First we need to install 🤗 Transformers, 🤗 Datasets, and 🤗 evaluate libraries from Huggingface.

In [5]:
! pip install datasets transformers evaluate
! pip install torchvision 



You should consider upgrading via the 'c:\users\nebiyou hailemariam\desktop\development\nlp\env\scripts\python.exe -m pip install --upgrade pip' command.


Collecting torchvision
  Using cached torchvision-0.14.1-cp38-cp38-win_amd64.whl (1.1 MB)
Collecting pillow!=8.3.*,>=5.3.0
  Downloading Pillow-9.4.0-cp38-cp38-win_amd64.whl (2.5 MB)
Collecting torch==1.13.1
  Using cached torch-1.13.1-cp38-cp38-win_amd64.whl (162.6 MB)
Installing collected packages: torch, pillow, torchvision
Successfully installed pillow-9.4.0 torch-1.13.1 torchvision-0.14.1


You should consider upgrading via the 'c:\users\nebiyou hailemariam\desktop\development\nlp\env\scripts\python.exe -m pip install --upgrade pip' command.


# Imports

We start by importing necessary libraries.

In [6]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import random
import collections
from tqdm import tqdm
import os
import json

import torch
from torch.utils.data import (
    DataLoader,
    Dataset
)

from datasets import load_dataset
from evaluate import load

from transformers import (
    AutoTokenizer,
    AutoModelForQuestionAnswering
)

## Setting up the GPU

Following that, we find the available GPUs and save the information in the `DEVICE` variable. This will be useful later on when we need to move tensors and models from CPU to GPU and vice versa.

In [7]:
if torch.cuda.is_available():
    DEVICE = torch.device("cuda")
    print("Using GPU: ", DEVICE)
else:
    DEVICE = torch.device("cpu")
    print("Using CPU: ", DEVICE)

Using CPU:  cpu


## Seeding the code

In [8]:
def set_random_seed(seed: int):
    """
    Helper function to seed experiment for reproducibility.
    If -1 is provided as seed, experiment uses random seed from 0~9999
    Args:
        seed (int): integer to be used as seed, use -1 to randomly seed experiment
    """
    print("Seed: {}".format(seed))

    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.enabled = False
    torch.backends.cudnn.deterministic = True

    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

In [11]:
SEED = 0
set_random_seed(SEED)

Seed: 0


# Preprocessing

We start by declaring a config dictionary containing all the hyperparameters required for data preprocessing and model inference. They are as follows:

1. `model_checkpoint`: The model to be used for from [Huggingface Model Hub](https://huggingface.co/models) for our QA task. We recommend using models the [RoBERTa Model](https://huggingface.co/deepset/roberta-base-squad2) that is already fine-tuned on the SQuAD datasets for good performance.
2. `max_length`: The maximum length of the input sequence. If left unset, the tokenizer will use the predefined model maximum length. (Ideally set between 300-512)
3. `truncation`: Determines whether to truncate the input sequence or not. See the documentation for details on the different values it can accept.
4. `padding`: Determines whether to pad the input sequence or not. See the documentation for details on the different values it can accept.
5. `return_overflowing_tokens`: Determines whether to return overflowing token sequences after truncation/ when the input sequence exceeds the maximum length.
6. `return_offsets_mapping`: Determines whether to return (char_start, char_end) for each token in the input sequence.
7. `stride`: The number of overlapping tokens between the truncated and the overflowing sequences.
8. `n_best_size`: The top 'n' answers to select from the predictions.
9. `max_answer_length`: The maximum length of the answer.
10. `batch_size`: The number of examples to be included in each batch. It should be selected properly such that the batch fits into the GPU. (Ideally from 16 to 128)

Check out the following links for more information on Tokenizers in huggingface.
1. [Summary of Tokenizers](https://huggingface.co/docs/transformers/v4.24.0/en/tokenizer_summary)
2. [Padding and Truncation](https://huggingface.co/docs/transformers/v4.24.0/en/pad_truncation)
3. [Batch Encoding](https://huggingface.co/docs/transformers/v4.24.0/en/internal/tokenization_utils#transformers.PreTrainedTokenizerBase.batch_encode_plus)

In [None]:
# TODO 1: fill in the values for all the hyper-paramters mentioned in the config dictionary.
config = {
    'model_checkpoint': ,
    "max_length": ,
    "truncation": ,
    "padding": ,
    "return_overflowing_tokens": ,
    "return_offsets_mapping": ,
    "stride": ,
    "n_best_size": ,
    "max_answer_length": ,
    "batch_size": 
}

## Loading the Dataset

For this assignment, we will be using [SQuAD](https://arxiv.org/abs/1606.05250), an academic benchmark for extractive question answering. We will use the [SQuAD 2.0](https://arxiv.org/abs/1806.03822), an updated version of the dataset containing harder examples as well as examples which do not have answers in the context.

We will use the `load_dataset` function from [🤗 Datasets](https://github.com/huggingface/datasets) library to load the dataset.

The `load_dataset` function returns a [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict) object, which conatins the *train* and *validation* splits for the dataset. We will be using only the *validation* split in this assignment.

In [None]:
%%time
datasets = load_dataset("squad_v2")
datasets

## Dataset Inspection

You can select any example in the dataset by specifying the split and the example index.
The code below prints a randomly selected example.

In [None]:
index = random.randint(0, len(datasets['validation']))
datapoint = datasets['validation'][index]
print(f"index: {index}\n")
for column, info in datapoint.items():
    print(f"\n{column}:\t{info}")

We can see that each example contains 4 fields, namely:
1.  ***id*** (a unique identifier for each example)
2. ***title*** (the genre of the example)
3. ***context*** (The paragraph on which the question is asked)
4. ***question*** (the actual question the ML model needs to answer)
5. ***answer*** (the actual answer to the question indicated in text as well as its start and end position in the context). During training, there is only one possible answer. For evaluation, however, there are several possible answers for each sample, which may be the same or different.


## Loading the Tokenizer and the Model

Next, we download the RoBERTa model fine-tuned on SQuAD along with its tokenizer. Refer to the [Huggingface documentation](https://huggingface.co/docs/transformers/autoclass_tutorial) to see how to load pretrained models and tokenizers.

The 🤗 Transformers `Tokenizer` tokenizes the input sequence and converts the tokens to their corresponding IDs in the pretrained vocabulary. It generates various inputs that a model requires such as input_ids, attention_mask, token_type_ids, etc. You can read more details about this in the Huggingface [Tokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer) and [Preprocessing](https://huggingface.co/docs/transformers/preprocessing) documentation.

The 🤗 Transformers `AutoModelForQuestionAnswering` is a transformer model with a span classification head for extractive question answering. It returns ***start_logits*** and ***end_logits***, marking the start and end of the answer, respectively. More details on this model is given [here](https://huggingface.co/docs/transformers/model_doc/roberta#transformers.RobertaForQuestionAnswering).

In [None]:
%%time
# TODO 2: Define the tokenizer and QA model. Transfer the QA model to GPU.
# tokenizer = 
# qa_model = 

## Dataset Class

Next we define the a custom Dataset class for our SQuAD corpus.

The class implements three main functions: 
1. `__init__`: This function is run once when instantiating the Dataset object. We generally initialize our raw dataset, tokenizer, and tokenized dataset in this function. 
2. `__len__`: This returns the total number of examples in our dataset. Note that we set it to the number of available ***input_ids*** and not the size of the raw dataset. This is because we are allowing context longer than maximum length of the model, resulting in increased number of features.
3. `__getitem__`: This function return the sample from our dataset at a given index. You are supposed to implement this function.

You don't need to understand the details of each of them for the purpose of this assignment. However, if you want to learn more, you can lookup the [PyTorch Documentation](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html).

### Important steps for preprocessing QA data: (`TODO`)

1. As QA task contains two input fields of the question and the context, we concatenate both of them to pass it to the model. Thus, the input is `[CLS] Question [SEP] Context [SEP]`. Note that we never want to truncate the question, only the context. So choose the `truncation` hyper-parameter in the [config cell](https://colab.research.google.com/drive/1HQ9z8cZE8TgLjlkAekEXfih9x9byB65r#scrollTo=8f323c24&line=2&uniqifier=1) accordingly.
2. As a result, in the case of very long documents, we must be careful not to lose the context that contains the answer. To resolve this concern, we will allow longer examples in our dataset to provide multiple input features, each of which is shorter than the maximum length (set as a hyper-parameter in the config dictionary). This can be done using the `stride` and `return_overflowing_tokens` hyper-parameters.
3. The `sequence ids` in the tokenized input can be used to distinguish between the various sequences in an input example. In our case, the question will be assigned 0 and the context will be assigned 1, because the former comes after the latter in the sequence. We know that the answer tokens always lie in the context. Hence, to make things easy for post-processing, we set the offset mapping of the tokens that are not a part of the context to -1.
4. The `overflow_to_sample_mapping` key return by the tokenizer is useful to map each feature we get to its corresponding label.

In [None]:
class QADataset(Dataset):
    
    def __init__(
        self,
        data,
        tokenizer,
        config
    ):

        self.config = config
        self.data = data
        self.tokenizer = tokenizer
        self.tokenized_data = self.tokenizer(
            self.data["question"],
            self.data["context"],
            max_length=self.config["max_length"],
            stride=self.config["stride"],
            truncation=self.config["truncation"],
            padding=self.config["padding"],
            return_overflowing_tokens=self.config["return_overflowing_tokens"],
            return_offsets_mapping=self.config["return_offsets_mapping"],
            return_attention_mask=True,
            add_special_tokens=True
        )
        
        example_ids = []
        for i, sample_mapping in enumerate(tqdm(self.tokenized_data["overflow_to_sample_mapping"])):
            example_ids.append(self.data["id"][sample_mapping])

            sequence_ids = self.tokenized_data.sequence_ids(i)
            offset_mapping = self.tokenized_data["offset_mapping"][i]
            
            # TODO 3: set the offset mapping of the tokenized data at index i to (-1, -1) 
            # if the token is not in the context
            

        self.tokenized_data["ID"] = example_ids
        
        
        
    def __len__(
        self
    ):
        # TODO 4: define the length of the dataset equal to total number of unique features (not the total number of datapoints)
    
    
    
    def __getitem__(
        self,
        index: int
    ):
        # TODO 5: Return the tokenized dataset at the given index. Convert the various inputs to tensor using torch.tensor
        return {
            'input_ids': ,
            'attention_mask': ,
            'offset_mapping': ,
            'example_id': ,
        }

## Creating Dataloader

1. We create an object of our custom QADataset class.
2. To access examples batch-wise, we create a [Dataloader](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) object that is an iterable around the Dataset object. The `Dataloader` takes the `Dataset object` and `batch size` as parameters. There are many other parameters that one can specify but we only need batch size for this assignment.
3. Note that the length of the Dataloader object multiplied by the batch size should approximately give you the size of the entire dataset.

In [None]:
%%time
eval_dataset = QADataset(
    data=datasets['validation'],
    tokenizer=tokenizer,
    config=config
)

In [None]:
%%time
eval_dataloader = DataLoader(
    eval_dataset,
    batch_size=config["batch_size"]
)
len(eval_dataloader)

We collect the raw and tokenized dataset in seperate variables as they will be required during post-processing.

In [None]:
eval_data = eval_dataset.data
eval_features = eval_dataset.tokenized_data

# Inference on SQuAD (`TODO`)

* In the cell below you are supposed to perform inference on the SQuAD valdiation set.
* You are supposed to iterate over the DataLoader object, pass the tokenized input to the model, and store the start and end logits.
* Note: Do not forget to transfer the start and end logits tensor from GPU to CPU. Convert them to numpy arrays.

In [None]:
def qa_inference(model, data_loader):
    model.eval()
    start_logits = []
    end_logits = []
    for step, batch in enumerate(tqdm(data_loader, desc="Inference Iteration")):
        with torch.no_grad():
            model_kwargs = {
                'input_ids': batch['input_ids'].to(DEVICE, dtype=torch.long),
                'attention_mask': batch['attention_mask'].to(DEVICE, dtype=torch.long)
            }    

            # TODO 6: pass the model arguments to the model and store the output

            # TODO 7: Extract the start and end logits by extending `start_logits` and `end_logits`

    # TODO 8: Convert the start and end logits to a numpy array (by passing them to `np.array`)

    # TODO 9: return start and end logits


In [None]:
%%time
start_logits, end_logits = qa_inference(qa_model, eval_dataloader)
start_logits.shape, end_logits.shape

# Postprocessing

The predictions could give rise to various difficulties:
1. The answer span could be the text in the question.
2. Answer would be too long.
3. The start position could be greater than the end position.

We have to do the following postprocessing steps to avoid the abouve mentioned senarios:
1. Skip answers that are not fully in the context (Hint: make use of the modified offset mapping done in the [preprocessing step](https://colab.research.google.com/drive/1HQ9z8cZE8TgLjlkAekEXfih9x9byB65r#scrollTo=916ff6d7&line=1&uniqifier=1)).
2. To select the best possible start and end logits, first sort them and select the top 'n' choices using the `n_best_size` hyper-paramter. Then iterate over the start and end logits and skip the answers with a length that is either < 0 or > `max_answer_length`.

In [None]:
def post_processing(raw_dataset, tokenized_dataset, start_logits, end_logits):
    
    # Map each example to its features. This is done because an example can have multiple features
    # as we split the context into chunks if it exceeded the max length
    data2features = collections.defaultdict(list)
    for idx, feature_id in enumerate(tokenized_dataset['ID']):
        data2features[feature_id].append(idx)

    # Decode the answers for each datapoint
    predictions = []
    for data in tqdm(raw_dataset):
        answers = []
        data_id = data["id"]
        context = data["context"]

        for feature_index in data2features[data_id]:

            # TODO 10: Get the start logit, end logit, and offset mapping for each index.


            # TODO 11: Sort the start and end logits and get the top n_best_size logits.
            # Hint: look at other QA pipelines/tutorials.
            
            for start_index in start_indexes:
                for end_index in end_indexes:
                    
                    # TODO 12: Exclde answers that are not in the context
                    
                    # TODO 13: Exclude answers if (answer length < 0) or (answer length > max_answer_length)
                    

                    # TODO 14: collect answers in a list.
                    answers.append(
                        {
                            "text": ,
                            "logit_score": ,
                        }
                    )

        best_answer = max(answers, key=lambda x: x["logit_score"])
        predictions.append(
            {
                "id": data_id, 
                "prediction_text": best_answer["text"],
                "no_answer_probability": 0.0 if len(best_answer["text"]) > 0 else 1.0
            }
        )    
    return predictions

In [None]:
%%time
predicted_answers = post_processing(
    raw_dataset=eval_data, 
    tokenized_dataset=eval_features,
    start_logits=start_logits,
    end_logits=end_logits
)

In [None]:
gold_answers = [{"id": ex["id"], "answers": ex["answers"]} for ex in eval_data][:len(eval_data)]

In [None]:
assert len(predicted_answers) == len(gold_answers)

In [None]:
print(predicted_answers[0])
print(gold_answers[0])

## Evaluating the Predictions

We use the `🤗 Evaluate` library from Huggingace for evaluating our predictions. 
Specifically, we evaluate the model based on two metrics:
1. `exact match`: This metric measures the percentage of predictions that match any one of the ground truth answers exactly.
2. `macro-averaged f1 score`: This metric mea- sures the average overlap between the prediction and ground truth answer. We treat the prediction and ground truth as bags of tokens, and compute their F1. We take the maximum F1 over all of the ground truth answers for a given question, and then average over all of the questions.

In [None]:
%%time
eval_metric = load("squad_v2")

In [None]:
eval_results = eval_metric.compute(predictions=predicted_answers, references=gold_answers)
eval_results

Save the SQuAD results as a json file. Make use to name the file as `squad_results.json`. 

Make sure to download the json file and upload on gradscope with the same name.

In [None]:
#TODO 15: save the metric results in squad_results.json file

# Blind Test Set

Now you will be given a blind test set for which you need to generate appropriate predictions using the functions given above.

You should be able to use code snippets from the SQuAD evaluation section.

## Load and preprocess the dataset

You can load the blind test set just like the SQuAD corpus using the `load_dataset` function from the `🤗 Datasets` library. 

More information on how to load a csv file using the load_dataset function is given [here](https://huggingface.co/docs/datasets/loading#csv).

In [None]:
# TODO 16: Load the blind test dataset using the load_dataset function, make sure to mention the split as `train`.
# test_dataset =

The code below prints a randomly selected example in the test set.

As you can see, the test data only contains id, title, context, and question and not the answers. Your job is to generate appropriate answers for the blind test dataset.

In [None]:
%%time
index = random.randint(0, len(test_dataset))
datapoint = test_dataset[index]
print(f"index: {index}\n")
for column, info in datapoint.items():
    print(f"\n{column}:\t{info}")

* Wrap the test data into the custom QADataset object.
* Create a dataloader to loop over the dataset.

In [None]:
# TODO 17: Define the QADataset object for the test data.

In [None]:
# TODO 18: Define the dataloader for the test set 

Collect the raw and tokenized dataset in seperate variables as they will be required during post-processing.

In [None]:
# TODO 19: Save the raw and tokenized test set into seperate variables.

## Inference on Test Set

Use the `qa_inference` function to generate the start and end logits.

In [None]:
# TODO 20: perfom inference on the blind test to get the start and end logits (use the qa_inference function)

Use the `post_processing` function to generate the final candidate answers.

In [None]:
# TODO 21: post process the predictions to generate the candidate answers.

Save the results as a json file. Make sure to name the file as `blind_test_predictions.json`.

In [None]:
#TODO 22: Save the candidate answers in `blind_test_predictions.json` file.