# Exercise 10 - Question Answering in NLP
In this exercise, you will experiment with one of NLP’s exciting tasks - Question Answering!

You will first evaluate a pre-trained model on Squad, a leading question-answering dataset, and evaluate its performance. Those with an approved access to GPUs in AWS or a different provider are encouraged to also fine-tune a base model on the Squad dataset.

We will use HuggingFace’s Transformers, the leading package for NLP tasks using transformers. Your code should roughly follow the code of [this guide](https://huggingface.co/docs/transformers/tasks/question_answering) and [this notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/question_answering.ipynb).  

(**Important Note:** The guide writes considerable amount of code to handle the case of context longer than the max input sequence. For simplicity, in your code, you should remove from the datasets all contexts longer than 
`max_length = 384`)


### Experiment Tracking using ClearML
This exercise utilizes large models. While we only fine-tune existing models, the time required for fine-tuning is still large (over one hour of run on a V100 GPU in many cases). You are not expected to make many runs yourself, but we would like to benefit from the joint work of the entire class, and will therefore report our runs to ClearML.

We will use a shared ClearML project for all student groups, so you can see runs of other teams (see instructions below). Specifically, we will choose a set of hyper-parameters for your own runs, and compare them to others, so we benefit from the community runs.


Install the transformers, datasets libraries

In [None]:
! pip install datasets transformers

Import required libraries.  
Make sure your version of Transformers is at least 4.11.0.

In [None]:
import transformers

print(transformers.__version__)

We will use the 🤗 [Datasets](https://github.com/huggingface/datasets) library to download the data and get the metric we need to use for evaluation (to compare our model to the benchmark). This can be easily done with the functions load_dataset and load_metric.

For our example here, we'll use version 1.1 of Stanford's [SQUAD dataset](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/). 

Load the Squad v1.1 dataset.

In [None]:
from datasets import load_dataset, load_metric

datasets = load_dataset("squad")

### Getting to know the dataset

The datasets object itself is DatasetDict, which contains one key for the training, validation and test set.

In [113]:
datasets

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

We can see the training, validation and test sets all have a column for the context, the question and the answers to those questions.

To access an actual element, you need to select a split first, then give an index:

In [None]:
datasets["train"][0]

Now, answer these questions:  

What is the shortest context in the training dataset?


What is the longest answer in the dataset?

Is there a question that appears multiple times? What is the most common question?

### Getting to know the HuggingFace transformers’ tokenizers
As a preprocessing step, the HuggingFace code tokenizes input sequences using a Tokenizer. Read more about tokenizers here:
https://huggingface.co/docs/tokenizers/pipeline  
https://huggingface.co/transformers/v3.0.2/preprocessing.html

For this question, use the BERT tokenizer. The tokenizer sometimes breaks words into smaller chunks, so the number of tokens can be larger than the number of words. 

Using the first 1,000 context datapoints, print the 30 most common tokens by the tokenizer.


### Load a pretrained Question Answering model
In this section, you will use a model pretrained on the Squad dataset for question answering.  

Choose a model you'd like to use.  
You can see a list of available models here: https://huggingface.co/models?dataset=dataset:squad&sort=downloads


Load the model.

### Pretrained Model Error Analysis
Here you will evaluate your model’s performance.

Write code to manually review a few errors of the model. 


Do you see a pattern there? Is there any hypothesis you form for cases where the model fails?

Write code that runs inference and outputs the predicted answer to a context and question texts typed by the user. We recommend that you use ipywidgets for interactivity:  
https://ipywidgets.readthedocs.io/en/latest/examples/Widget%20Basics.html

Use the award-winning GUI you’ve just created, to try to manually poke holes in the model. Try to characterize the cases your model mishandles.


Next, evaluate your model’s performance for different lengths of input text and of answer length.

BONUS: Can you think of other axes that would be interesting to use to evaluate your model?

### [Advanced] Fine-tune a Model
Here, you will fine-tune a base model to the Squad dataset, and evaluate its performance. 

Note that since the model is large, running this exercise might cost dozens of dollars. You’re very encouraged to use your AWS credits for that. We also recommend to work on the code on a cheap CPU-based machine, make sure that it runs properly, and then move to more expensive instances with GPUs for actual runs.

What metric do you find suited? Why?


Train the model to fine-tune on the dataset. 

Make sure you use ClearML to report your results in real-time.

Write below your train and validation loss. You can compare your results to those of others from the class using ClearML.

## ClearML Integration
You can join the ClearML slack channel at  https://join.slack.com/t/clearml/shared_invite/zt-c0t13pty-aVUZZW1TSSSg2vyIGVPBhg to ask questions and see real life examples and questions of industry users.

### Parameters & Configurations
Keep your parameters/configs in a single dict within your code. For example 

`config={"param":"data", ...}`

This way you can easily connect into clearml using

`Task.connect(config)`

The documentation is here:
https://clear.ml/docs/latest/docs/references/sdk/task#connect_configuration


### Comparisons
Once you gathered some data, you can select multiple experiments and compare them, as detailed here: https://clear.ml/docs/latest/docs/webapp/webapp_exp_comparing/

### Additional Videos
We also recommend you review these videos to learn industry best practices:
- Day in the life of a data scientist - This video will cover nearly everything your might need to use clearml - https://www.youtube.com/watch?v=quSGXvuK1IM
- Detection in video on raspberry pi – a real world example of what can be done with ML (an example of a nice portfolio project) -  https://www.youtube.com/watch?v=ZiOr9EdYEeE


## Recommended Resources
For an open discussion on Question Answering related topics, you are very encouraged to watch this workshop: https://www.youtube.com/watch?v=Ihgk8kGLpIE

This screencast uses T5 on a different Q&A dataset: https://www.youtube.com/watch?v=_l2wJb3QPdk



That's it - good luck!