# Evaluation of BERT models on Question Answering Tasks

## Introduction

Question Answering is an important Natural Language Processing Task wherein a system, given a natural language question and a context document, returns the correct answer to the question. Question Answering is an on-going research domain in Natural Language Processing which has had performance breakthroughs in recent times, after the introduction of the Transformer based models.

We have used the Stanford Question Answering Dataset (SQuAD) consisting of questions posed by crowdworkers based on certain Wikipedia articles. SQuAD2.0 combines questions from the SQuAD1.1 and adds unanswerable questions to the list. For a system to perform well, it has to also determine when to abstain from answering. We aim to use DistilBERT, ELECTRA, and RoBERTa, fine-tuned on the SQuAD 2.0 dataset. Ultimately, we want to provide comparisons and analysis of which models work the best and the reasoning behind it.

### Data Analysis

We performed a detailed analysis on the data to understand the nature of SQUAD 2.0. 

The [data analysis](./data_analysis.ipynb) notebook will take you through the analysis and inferences. To summarize the analysis:

1. Contexts are densely present with word length of 100-150. This helps us understand how to use our models and the fine-tuning required to handle such cases. The mean length of contexts is 137.9 words and the maximum length is 766 words. The shortest context consists of 22 words.
![](./assets/context_length.png)

2. Questions comprise of 10-15 words on an average. The mean of the questions length is 11.29 words and the longest question in the dataset is 60 words long.
![](./assets/question_length.png)

3. The answers on the other hand are comparatively shorter. The following figure shows that answers mainly are are 3-4 words long. The longest answer is around 46 words long, and the shortest is just comprised of a single word.
![](./assets/answer_length.png)

4. The Objective of our models is to predict the starting index and the ending of the answers from the given context. It would be interesting to know how the indices of the starting word of answers compares to the context. The folowing graph shows that the starting indices of the answer spans occur more frequently at the beginning of the context, but have a tapering frequency as the context length keeps increasing.
![](./assets/start_index.png)

## Implementation

We used the pre-trained DistilBERT-base-uncased, ELECTRA-base, and RoBERTa-base networks from Hugging Face using SimpleTransformers. The library contains easy-to-use pre-trained Question-Answering BERT models of type ALBERT, BERT, DistilBERT, ELECTRA, XLM, and XLNet. The training took about 13 hours per model and per run on a 6GB NVIDIA GeForce GTX 1060 GPU.

#### I. Model Selection

Model Selection is performed using a command-line argument of one of the three models during training, evaluation, checking the model performance.

1. distilbert
2. electra-base
3. roberta

For ease of comparison, we have included a shell script `run.sh` that sequentially runs all the models by removing any cache and passing the right set of arguments as following.

```
rm -r cache_dir
python train.py distilbert

rm -r cache_dir
python train.py roberta

rm -r cache_dir
python train.py electra-base
```

The training and the evaluation scripts contain the following code to select the model type.

In [2]:
model_type = "roberta"

if model_type == "distilbert":
    model_name = "distilbert-base-uncased-distilled-squad"

elif model_type == "roberta":
    model_name = "deepset/roberta-base-squad2"

elif model_type == "electra-base":
    model_type = "electra"
    model_name = "deepset/electra-base-squad2"

### II. Model Training


We load the pre-trained models from Hugging Face using the SimpleTransformers library particularly for QuestionAnswering task of NLP.

In [3]:
from simpletransformers.question_answering import QuestionAnsweringModel

**Fine-tuning**:
We fine-tuned the pre-trained model on the SQuaD 2.0 data for two complete training epochs by tuning some of the hyper-parameters like learning rate and gradient accumulation steps. We experimented with different values and saw that we got decent results with the following hyper-parameters. We monitored the training process and stopped the model training if we saw that the loss was not improving over a short period, due to time constraints.

* learning_rate (Amount by which the weights are updated during training) = 4e-5
* adam_epsilon (The value added in Adam Optimizer to avoid division by zero) = 1e-8
* warmup_ratio (Ratio of steps used for warm-up (very low learning rate)) = 0.06
* max_grad_norm (Gradient clipping value used to avoid exploding gradients) = 1.0

In [None]:
train_args = {
    "reprocess_input_data": False,
    "overwrite_output_dir": True,
    "use_cached_eval_features": True,
    "output_dir": f"./models/{model_type}",
    "max_seq_length": 128,
    "num_train_epochs": 2,
    "wandb_project": "QuestionAnswering Model Comparison",
    "wandb_kwargs": {"name": model_name},
    "train_batch_size": 8,

    'weight_decay': 0,
    'learning_rate': 4e-5,
    'adam_epsilon': 1e-8,
    'warmup_ratio': 0.06,
    'warmup_steps': 0,
    'max_grad_norm': 1.0,
}

# load the trained model
model = QuestionAnsweringModel(model_type=model_type, 
                               model_name=f"./models/{model_type}/",
                               args=train_args, 
                               use_cuda=True)

We generated model files are stored separately in their respective folders specified by `model_name=f"./models/{model_type}/"` in the training args.

```
├── models 
      └── distilbert
      └── electra
      └── roberta
```

### III. Predict on Dev Data

**Input**
The input is of the following format where each paragraph contains contexts and an array of associated questions identified by Question IDs and a boolean value `is_answerable` that suggests if the question is answerable or not.
```
[{
    "title": "Normans", 
    "paragraphs": [
        {
            "context": "The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (\"Norman\" comes from \"Norseman\") raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, agreed to swear fealty to King Charles III of West Francia. Through generations of assimilation and mixing with the native Frankish and Roman-Gaulish populations, their descendants would gradually merge with the Carolingian-based cultures of West Francia. The distinct cultural and ethnic identity of the Normans emerged initially in the first half of the 10th century, and it continued to evolve over the succeeding centuries.",
            "qas": [{"question": "In what country is Normandy located?", 
                    "id": "56ddde6b9a695914005b9628", 
                    "answers": [{"text": "France", "answer_start": 159}, 
                                ...],
                    "is_impossible": false}]
             ...
         }
   ...
}]

```

We then use our fine-tuned models to predict on the dev data. The model generates top 10 answer predictions and we pick the top 1 answer for scoring.

In [None]:
preds, _ = model.predict(dev_data)

# chose one of top 10 predictions
predictions = {pred['id']: pred['answer'][0] for pred in preds}

This generates a JSON output file called `predictions.json` in the respective output folders of the model. The predictions contain the question ID asoociated with the predicted answers.

```
{
    "56ddde6b9a695914005b9628": "France.", 
    "56ddde6b9a695914005b9629": "10th and 11th centuries"
    ...
}
```

```
├── output
      └── distilbert
            └── predictions.json
      └── electra
            └── predictions.json
      └── roberta
            └── predictions.json
```

### IV. Web Application implementation

The folder structure of the application is as follows -

```
├── web_application
      └── app.py
      └── static|
            └── css
                └── home.css
            └── js
                └── home.js
      └── templates
            └── base.html
            └── home.html
            
```


After installing the requirements, the app can be run by 
```
$ python app/app.py

```

The app should start at 127.0.0.1:5000/




The get_data function, takes in the user entered question and context as a POST request and runs our best performing model prediction on it. It returns the result in a JSON format.
We have placed the RoBERTa model as the model that gives out the prediction.

In [None]:
@app.route('/data', methods = ['POST'])
def get_data():
	if request.method == 'POST':
		if(request.get_json() is None):
			data = request.form
		else:
			data = request.get_json()
		context = data['context']
		question = data['question']

		to_predict = [{'context': context, 'qas': [{'question':question,'id':'0'}]}]

		model_type = "roberta"
		model = QuestionAnsweringModel(model_type=model_type, 
                               model_name=f"../models/{model_type}/", use_cuda = False)

		preds, _ = model.predict(to_predict)

		print(preds[0]['answer'][0])
		if(preds[0]['answer'][0] == ""):
			result = "No answer found"
		else:
			result = preds[0]['answer'][0]

		return jsonify({'output':result})



### Model Evaluation

The evaluation is performed on the dev data using Precision, Recall, and F1 scores. The answers are first normalized by removing any stop words, punctuations, and conversion to lower case. 

Two evaluation metrics are computed:
1. Raw Scores: Raw scores represent the exact count of tokens that match with one of the gold answers. This value is always an empty string for unanswerable questions.
2. F1 Score: F1 score is computed by calculating the precision and recall over the predicted tokens. An F1 socre of 1 is given to the unanswerable questions when both predicted and gold answers are emoty strings.

$$
Predicted = \frac{1.0 * count(tokens_{same})}{1.0 * count(tokens_{total})}
$$

$$
Recall = \frac{1.0 * count(tokens_{same})}{1.0 * count(tokens_{gold})}
$$

In [None]:
def compute_f1(a_gold, a_pred):
  gold_toks = get_tokens(a_gold)
  pred_toks = get_tokens(a_pred)
  common = collections.Counter(gold_toks) & collections.Counter(pred_toks)
  num_same = sum(common.values())
  if len(gold_toks) == 0 or len(pred_toks) == 0:
    # If either is no-answer, then F1 is 1 if they agree, 0 otherwise
    return int(gold_toks == pred_toks)
  if num_same == 0:
    return 0
  precision = 1.0 * num_same / len(pred_toks)
  recall = 1.0 * num_same / len(gold_toks)
  f1 = (2 * precision * recall) / (precision + recall)
  return f1

def get_raw_scores(dataset, preds):
  exact_scores = {}
  f1_scores = {}
  for article in dataset:
    for p in article['paragraphs']:
      for qa in p['qas']:
        qid = qa['id']
        gold_answers = [a['text'] for a in qa['answers']
                        if normalize_answer(a['text'])]
        if not gold_answers:
          # For unanswerable questions, only correct answer is empty string
          gold_answers = ['']
        if qid not in preds:
          print('Missing prediction for %s' % qid)
          continue
        a_pred = preds[qid]
        # Take max over all gold answers
        exact_scores[qid] = max(compute_exact(a, a_pred) for a in gold_answers)
        f1_scores[qid] = max(compute_f1(a, a_pred) for a in gold_answers)
  return exact_scores, f1_scores


## Results


RoBERTa is a clear winner and it is not a surprise considering its complexity, data, and pre-training approach. While we clearly notice a compute-performance trade-off with our experiments, it is important to note that DistilBERT provided fairly good results considering the model simplicity, training speed while utilizing less computational resources than the other two models. 

|                   |**DistilBERT** | **ELECTRA** | **RoBERTa** |
| ----------------- | ------------- | ----------- | ----------- |
| **HasAns_exact**  | 47.048        | 57.557      | 58.87       |
| **HasAns_f1**     | 51.75         | 63.94       | 63.65       |
| **NoAns_exact**   | 69.45         | 70.06       | 86.53       |
| **NoAns_f1**      | 69.45         | 70.06       | 86.53       |
| **exact**         | 58.27         | 63.82       | 72.72       |
| **f1**            | 60.61         | 67.00       | 75.10       |


* Looking at the figures below we can say that the performance of DistilBERT is fair considering the simplicity of the model, shorter runtime using much lesser computational resources.

![](./assets/loss.png?raw=true "Relative Loss")



Relative Runtimes                                                | Relative GPU Utilization
:---------------------------------------------------------------:|:-----------------------------------------------------------------------:
![](./assets/relative_runtime.png?raw=true "Relative Runtimes")  |  ![](./assets/relative_gpu_util.png?raw=true "Relative GPU Utilization")



## Conclusion

* RoBERTa is a clear winner and it is not a surprise considering its complexity, data, and pre-training approach.

* While we clearly notice a compute-performance trade-off with our experiments, it is important to note that DistilBERT provided fairly good results considering the model simplicity, training speed while utilizing less computational resources than the other two models. 