# Knowledge Intensive NLP Summer School

## Notebook 1

The goals of this notebook are:

* Familiarization with the Huggingface Transformers Library
* Train a BERT model for natural language inference
* (Extension) Train a T5 model for Question Answering


## Resources
You can find help for the HuggingFace library from their website: 

* BERT https://huggingface.co/docs/transformers/model_doc/bert
* T5 https://huggingface.co/docs/transformers/model_doc/t5
* Datasets https://huggingface.co/docs/datasets/index

## Tutorial

This notebook is based on the following tutorials:

* BERT https://github.com/huggingface/transformers/tree/main/examples/pytorch/text-classification
* Fine-tuning https://huggingface.co/docs/transformers/training
* Language Generation https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html




# Exercise 1 - Familiarization with BERT

In this exercise, we will download a pre-trained BERT model and explore predicting missing tokens

## Exercise 1.1 - Tokenization

Download the tokenizer for BERT model

In [5]:
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("bert-base-uncased")

Tokenize a test string. Try putting in some other rarer words. What do you notice?

In [None]:
tok.tokenize("hello world this is a test")

In [None]:
tok.tokenize("i ate fettuccine alfredo")

**Question: Describe the effect of the bert-base-uncased tokenizer on long / rare words**

**Answer:** to be completed

The tokenizer function can also prepare the input features for the model. Use the tokenizer to "encode" the string in the previous example using the `tok.encode` function

**Question: Describe the effect of the tok.encode function in the bert-base-uncased tokenizer. Do you have more tokens than the output of tok.tokenize()? Focus on the start/end symbols of the string.**

**Answer:** to be completed

**Question: What is the data type of the encoded string after using the tok.encode function?**

**Answer:** to be completed

We can decode the output of the encoded tokens using the `tok.decode` function. What is the effect of the 

In [None]:
# encode the string 
# decode the list of IDs

print(encoded)
print(decoded)

## Exercise 1.2 - Encoding with BERT Model

Download and load the pre-trained BERT model from HuggingFace. This may take a while (440MB download)

In [None]:
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")

**Question** Use code to count how many parameters the BERT model has.

**Answer** to be completed (hint: https://github.com/huggingface/transformers/issues/1479)

We can encode a string with the tokenizer and encode this with the BERT model. This assumes that `encoded` was set in Exercise 1.1

In [None]:
import torch
outputs = model(input_ids=torch.LongTensor([encoded]))
print(outputs)

**Question:** When calling the model, we made 2 changes to the encoded string. What is the effect of this? Why do we use square brackets, what is the purpose of torch.LongTensor?

`torch.LongTensor([encoded])`

**Answer:** to be completed

In [None]:
# Hint: https://pytorch.org/docs/stable/tensors.html

print(torch.LongTensor([1,2,3,4]))
print(torch.LongTensor([[1,2,3,4]]))

We can understand what the outputs of the model are by using the `.shape` property of a torch tensor.

In [None]:
outputs.last_hidden_state.shape

**Question:** What do these dimensions of the 3D tensor generated by the mode correspond to? Try changing the input string or using tokenizer to encode multiple strings. (hint, you may have to encode with `padding=True` enabled)

**Answer:** to be completed

**Question:** How could this model output be used for (a) a masked language modelling task and (b) a sequence classification task.

**Answer:** to be completed

## Exercise 1.3 - Language Modelling with BERT

Load a version of the model designed for masked language modelling

In [None]:
from transformers import AutoModelForMaskedLM
language_model = AutoModelForMaskedLM.from_pretrained("bert-base-uncased")

Using the test string, encode it, and make a prediction with the model. 

In [None]:
test_string = "The capital city of Italy is [MASK]."

In [None]:
# Answer
...

In [None]:
print(outputs.logits)

Print the `outputs.logits` shape. What do these dimensions correspond to?

**Question:** What token ID is the [MASK] token encoded as?

**Answer:** to be completed

**Question:** What do the dimensions of outputs.logits correspond to?

**Answer:** to be completed

Use `torch.argmax` to get the ID of the predicted token from the logits. Hint- you may have to set `axis=` argument on argmax to return the right value. Then use the tokenizer's decode method to return this as a string. Hint you may have to convert the pytorch tensor to a python list using `.tolist()` and also only put a single input into the decode function 

**Question:** What word does the model predict?

**Answer:** to be completed

# Exercise 2 - Fine-tuning BERT for NLI

## Exercise 2.1 - Loading NLI Dataset

We can load the NLI dataset from Huggingface datasets library. This might take a while (313MB download)

In [8]:
from datasets import load_dataset
dataset = load_dataset("glue", "mnli")

Found cached dataset glue (/Users/user/.cache/huggingface/datasets/glue/mnli/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/5 [00:00<?, ?it/s]

We can see the example of the training data: the dataset constains a list of instances

In [9]:
dataset["train"][0]

{'premise': 'Conceptually cream skimming has two basic dimensions - product and geography.',
 'hypothesis': 'Product and geography are what make cream skimming work. ',
 'label': 1,
 'idx': 0}

**Question:** What do each of the fields in each dataset instance correspond to?

**Answer:** To be completed

For use in the model, we will have to tokenize the instances from the dataset. Use the tokenizer to tokenize the instance, putting both premise and hypothesis into a single list of IDs.

In [10]:
print(tok.tokenize(dataset['train'][0]['premise'],dataset['train'][0]['hypothesis']))
print(tok.encode(dataset['train'][0]['premise'],dataset['train'][0]['hypothesis']))

['conceptual', '##ly', 'cream', 'ski', '##mming', 'has', 'two', 'basic', 'dimensions', '-', 'product', 'and', 'geography', '.', 'product', 'and', 'geography', 'are', 'what', 'make', 'cream', 'ski', '##mming', 'work', '.']
[101, 17158, 2135, 6949, 8301, 25057, 2038, 2048, 3937, 9646, 1011, 4031, 1998, 10505, 1012, 102, 4031, 1998, 10505, 2024, 2054, 2191, 6949, 8301, 25057, 2147, 1012, 102]


**Question:** How many separator tokens were inserted by the tokeninzer? Why?

**Answer:** To be completed

## Exercise 2.2 - Tokenization

We will make a tokenizer function to tokenize all instances in the dataset

In [27]:
def tokenize_function(data):
    # What is the type of data?
    # Keys of the returned dictionary will be added to the dataset as columns
    # To be completed
    return tok(...., padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

From this, we will use the tokenizer to make a small train and validation dataset

In [19]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))

We want to perform sequence classification with the BERT model. We can use the `AutoModelForSequenceClassification` class to instantiate a `bert-base-uncased` model. Remember you will have to import this class like we did the previous time. Call the model `cls_model` 

In [20]:
# To be completed
cls_model = ..... 

## Exercise 2.3 - Training

**Question:** Make a prediction with the cls_model on the first instance of the dataset.  You will notice a paramter in the model outputs called `loss`. If you only provide input_ids, what is the loss value? If you provide both `input_ids` and `labels`, it might differ. What does this loss correspond to?

**Answer:** To be completed

## Exercise 2.4 
Follow the HuggingFace tutorial to train the model for NLI (note, you can skip loading model and preparing data). Only follow the PyTorch tutorial. Do not follow the Keras/Tensorflow tutorials. 

Report the accuracy on the 100 validation instances

* https://huggingface.co/docs/transformers/training

In [1]:
from transformers import TrainingArguments, Trainer
import evaluate



# Exercise 3* - Sequence to Sequence QA
* This will be covered in depth on Tuesday and Wednesday.
* If you finish early. Try using the example code from HuggingFace to fine-tune a T5 transformer for question answering with SQuAD.
* Example code: https://github.com/huggingface/transformers/blob/v4.30.2/examples/pytorch/question-answering/trainer_seq2seq_qa.py