# Knowledge Intensive NLP Summer School

## Notebook 2

The goals of this notebook are:

* Use BERT-based DPR model for retrieving passages
* Train a T5 model for predicting which page contains the answer


## Resources
You can find help for the HuggingFace library from their website: 

* BERT https://huggingface.co/docs/transformers/model_doc/bert
* T5 https://huggingface.co/docs/transformers/model_doc/t5
* Datasets https://huggingface.co/docs/datasets/index

## Tutorial

This notebook is based on the following tutorials:

* BERT https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification
* Fine-tuning https://huggingface.co/docs/transformers/training
* Language Generation https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html


## Exercise 1 - DPR for SQuAD

In [1]:
# Download the SQuAD dataset from huggingface hub (approx 30MB)
from datasets import load_dataset
dataset = load_dataset("squad")

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text to /Users/user/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /Users/user/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Inspect the first instance of the dataset. Today we will use the Wikipedia page titles, the context and the question

In [2]:
dataset["train"][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

Use a set to create a collection of all unique context paragraphs

In [5]:
unique_documents = set()
unique_questions = set()

for item in list(dataset["train"]) + list(dataset["validation"]):
    unique_documents.add(item['context'])
    unique_questions.add(item['question'])

This is the set of documents and queries we have to work with.  

In [8]:
print(len(uniques))

20958


## Exercise 1.1 - Dense Embedding

Use the HuggingFace pre-trained DPR enoder(s) to generate embeddings of the questions and documents. Save these in an appropriate data structure

NB: there are two models a question encoder and a context encoder.

* Introduction / Model pages: 
 * https://huggingface.co/facebook/dpr-question_encoder-single-nq-base
 * https://huggingface.co/facebook/dpr-ctx_encoder-single-nq-base
 
* Detailed documentation: https://huggingface.co/docs/transformers/model_doc/dpr

Hints:
* Using the `pooler_output` will yield a single embedding for the passage. 
* You may have to limit the max length of the string to 100 tokens with the tokenizer.
* Embedding all passages will take some time. If slow, use a subset of passages

## Exercise 1.2 - Dense Retrieval using Embeddings

* Sample a question from the SQuAD dataset
* Using the embeddings compute the similarity between the passage and all documents.  (Hint: This is an dot-product operation of the document vector and query vector)
* Provide a list of the top 10 most likely candidate documents (hint Torch has a topk function)

## Exercise 1.3 - Evaluation

* For the entire validation set (or a suitably large subset that will run on your laptop), perform retrieval and return the top-10 pages. 
* How can you speed this up?
* Compute the Recall@10 of the predicted pages

## Exercise 1.4 - Training / Fine-tuning

* For a given question, sample 10 "negative" contexts from the dataset
* Estimate the loss of prediction with the noise contrastive formulation (see the DPR paper eqn 2) https://arxiv.org/abs/2004.04906
* Using pytorch-style training loop, fine-tune the model on a sample of instances 
* See this tutorial for more info: https://pytorch.org/tutorials/beginner/introyt/trainingyt.html#the-training-loop

# Exercise 2 - Sequence to Sequence Formulation

For simplicity, these experiments should use the t5-small model. Training and inference might be slow. 

* Model page: https://huggingface.co/t5-small
* Detailed information: https://huggingface.co/docs/transformers/model_doc/t5

**This exercise will require the sentencepiece library. If not installed, pip install sentencepiece. This may require restart of the jupyter runtime.**

In [10]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp38-cp38-macosx_10_9_x86_64.whl (1.2 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m0:01[0m:01[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


## Exercise 2.1 Download T5 model and make simple inferences
This t5-small model is approx 225MB

In [2]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

In [15]:
input_ids = tokenizer("Summarize: Where is Rome?", return_tensors="pt").input_ids
generated_tok = model.generate(input_ids=input_ids)
generated_str = tokenizer.batch_decode(generated_tok.tolist())

print(generated_tok)
print(generated_str)

tensor([[    0, 11068, 15793,    10,  3488,   229,  7332,    58,     1]])
['<pad> Zusammenfassen: Wo ist Rome?</s>']


Try some other strings. What do you notice about the behaviour of the model?

## Exercise 2.2 Estimate the loss of a prediction made by the model

In [19]:
input_ids = tokenizer("Summarize: Where is Rome?", return_tensors="pt").input_ids
target_ids = tokenizer("Italy", return_tensors="pt").input_ids

output = model(input_ids=input_ids,labels=target_ids)

If we make a prediction with the model without calling generate, we see the loss, and also the tokens that the model predicted.

In [23]:
print(output.logits.shape)
print(output.loss)

torch.Size([1, 2, 32128])
tensor(8.5941, grad_fn=<NllLossBackward0>)


Estimate the average loss of predicting the Wikipedia title of pages for questions in the Squad dataset using the T5 model.

## Exercise 2.3 Training T5 model

Implement a training loop in Pytorch, following ex 1.4 

## Exercise 2.4* Repeat Ex2.3 using the HuggingFace trainer (see Lab 1, Ex2.4)