<a href="https://colab.research.google.com/github/marcon21/anlp-labs/blob/main/03_ANLP_Measuring_Quality.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Start by copying this into your Google Drive!!

Maastricht_University_logo.svg

# Advanced Natural Language Processing Course - Tutorial Measuring Quality
Author: Gijs Wijngaard




Version 2024-2025.1


---



Welcome to the tutorial on measuring quality.


The first step is to **enable GPU**. A GPU is a Graphical Processing Unit, capable of calculating vectors and matrices much faster than CPU units, like the one in your laptop. Since neural networks are basically made out of matrices, we gain serious speed improvements by using GPU's.

We enable the GPU by clicking on *Runtime* in the menu above, then click *Change runtime type* and on the dropdown menu under *Hardware accelerator* we click *GPU*. Then click *Save*. If everything is correct, the below code should return *True*

---

## Measuring Output Quality of a Classification Model
We start with training a machine learning model first. We train a Transformers model on the most popular benchmark in natural language processing, named GLUE. This benchmark and its successor SuperGLUE are used in NLP research a lot to compare models to each other. Its a way for any model to test it if is performing well or not. You can find the benchmark [here](https://gluebenchmark.com/) and its [successor](https://super.gluebenchmark.com/) here. We are going to use one of its datasets as a task for our model to train on. Lets install some packages first:

In [None]:
!pip install -qq transformers datasets

GLUE consists of 11 datasets. Today, we will focus on only one of these, the Corpus of Linguistic Acceptability. This dataset is a dataset to test whether a model can recognize whether a sentence is actual English, or contains some spelling or grammatical mistakes. Lets import it:

In [None]:
from datasets import load_dataset
data = load_dataset("glue", "cola")

Lets display how our data looks like. This is a example of a correct sentence in our dataset (label = 1)

In [None]:
data["train"][0]

Hereunder is an example of an incorrect sentence in our dataset (label = 0). You can't drink a pub right? That is for the model to recognize, can it find sentences that are incorrect.

In [None]:
data["train"][18]

Lets now import our model. This is the first time we work with transformers models. Transformers is a library by HuggingFace. When working with transformer-based models, its one of the most convenient tools you can have. It supports all types of different trasnformer models, and you can download pretrained models from the transformers library to apply it to your own data. Normally, transformer models work well when trained on large datasets. With pretrained models, these models are already trained on large datasets thus do not need to be trained again. Its handy for applying state-of-the-art models on any problem you have.

Let's use the most standard transformer model, that of BERT. We can use BERT for a variety of tasks, this time we will use it for sequence classification.  In a later lecture you will learn more about BERT and its applications. For our task, we want to know whether our data (which is a sequence) is a right english sentence or not (binary classification task).
We also use `.to(device)` method to speed things up. If you don't want to wait long when predicting outputs using transformers, make sure you are on GPU in colab.

You can ignore the warnings starting with `Some weights ...`, their just for letting you know what you can do with the model.

In [None]:
from sklearn import metrics
from tqdm import tqdm
import numpy as np
import torch
from transformers import BertTokenizer, BertForSequenceClassification
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased").to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-5)

Lets train our model first. We first define a dataloader. This function is used so that we can have large batches (`batch_size=16` in this case) and process multiple data inputs at once. This even speeds up the computation more. We first nullify our gradients. We put the data through the tokenizer, so that we get numbers instead of texts. Also an `attention_mask` is received from our tokenizer, so that our transformer model knows which part of the data to focus on: we pad the data to let it fit through the model. As model input, we feed the output of the tokenizer and a label, so that the model can compute a loss that defines how close we are to the label. We then backpropagate.

In [None]:
losses = []
train_dataloader = torch.utils.data.DataLoader(data["train"], batch_size=16)
for item in tqdm(train_dataloader):
    model.zero_grad()
    inputs = tokenizer(item["sentence"], padding=True, return_tensors="pt").to(device)
    result = model(**inputs, labels=item["label"].to(device))
    loss = result.loss
    loss.backward()
    optimizer.step()
    losses.append(loss.item())

Lets now define a simple prediction function. Remember, always train on a training set, and test your model on a validation set.

We don't want to compute gradients, since we will not backpropagate the data (`torch.no_grad()`). We get the logits and move that onto the cpu with `.cpu().numpy()`.

In [None]:
val_dataloader = torch.utils.data.DataLoader(data["validation"], batch_size=16)
results = []
for item in tqdm(val_dataloader):
    inputs = tokenizer(item["sentence"], padding=True, return_tensors="pt").to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
    results.append(logits.argmax(dim=-1).cpu().numpy())
results = [result for result_array in results for result in result_array]

These are our true values: The actual correct values of the validation set

In [None]:
val_labels = [data["label"] for data in data["validation"]]

We again can use the `accuracy_score` function from `scikit-learn`. This function predicts for us the `accuracy`: how many true positives and true negatives devided by all predictions we have.

In [None]:
metrics.accuracy_score(val_labels, results)

We also can plot a confusion matrix with `scikit-learn`. These values correspond to each of the 4 sectors. True positives,  true negatives, false positives, false negatives.

In [None]:
metrics.ConfusionMatrixDisplay.from_predictions(val_labels, results)

We can get the individual values using sklearn's `confusion_matrix`:

In [None]:
tn, fp, fn, tp = metrics.confusion_matrix(val_labels, results).ravel()
tn

In [None]:
fn

### Exercise 1.1
> 1. Compute the following metrics by hand: $$Precision = \frac{TP}{TP+FP}\quad Recall = \frac{TP}{(TP+FN)} \quad Accuracy = \frac{TP + TN}{TP + TN + FP + FN} \quad F1 = 2 * \frac{Precision * Recall}{Precision + Recall}$$ using the variables `tn`, `fp`, `fn`, `tp` above.
2. Why is `accuracy` not a good metric for this dataset?

In [None]:
# COMPUTE PRECISION, RECALL, ACCURACY AND F1 HERE

ANSWER HERE:

1.1.2:

### Exercise 1.2
> 1. When do we prefer precision?
2. When do we prefer recall?
3. Give an examples of datasets that you could encounter where you would prefer one over the other.

ANSWER HERE:

1.2.1:

1.2.2:

1.2.3:



### Exercise 1.3
> 1. Now also compute $F0.5$ and $F2$ metrics.
2. How do these compare to the $F1$ metric?

ANSWER HERE

In [None]:
# COMPUTE F0.5 and F2 metrics here

1.3.2:

### Exercise 1.4
> Instead of using `accuracy`, the GLUE benchmark uses a different metric for this dataset, Matthews correlation coefficient (also known as the Phi coefficient). $$MCC = \frac{TP \times TN - FP\times FN}{\sqrt{(TP+FP)\times(TP+FN)\times(TN+FP)\times(TN+FN)}}$$ When computing this coefficient, we should get a value between 1 and -1, where 1 is a perfect prediction, 0 a random prediction and -1 a inverse prediction. Now compute also the MCC:

In [None]:
# COMPUTE MCC HERE


## Using Model's probabilities
We now computed some metrics using the predictions of the model: whether the model thinks its correct or incorrect. To calculate this, we took the `argmax()` of the logits. Let's focus one more time on what we did.

In [None]:
inputs = tokenizer(next(iter(val_dataloader))["sentence"], padding=True, return_tensors="pt").to(device)
with torch.no_grad():
    logits = model(**inputs).logits
logits = logits.cpu().detach()

We used these logits and for every pair of two we took whatever value is the highest (`argmax()`). Now we can also take the softmax, this then converts our values to make them sum together to 1 per prediction:

In [None]:
logits.softmax(dim=-1)

As you can see, we now have a percentage per prediction of how sure the model is about that prediction, the the first value being the sentence is incorrect, the second value the sentence being correct. Lets now only take the last value of each prediction: we only need the chance that a value belongs to 1, if its higher than 50% probability it belongs so, if lower, it belongs to 0.

In [None]:
percentage = logits.softmax(dim=-1)[:, -1]

In [None]:
percentage

Lets collect the percentages for the whole validation dataset, and plot the precision-recall curve:

In [None]:
percentages = []
for item in tqdm(val_dataloader):
    inputs = tokenizer(item["sentence"], padding=True, return_tensors="pt").to(device)
    with torch.no_grad():
        logits = model(**inputs).logits
    percentages.append(logits.softmax(dim=-1)[:, -1].cpu().numpy())
percentages = np.array([result for result_array in percentages for result in result_array])

In [None]:
from sklearn.metrics import PrecisionRecallDisplay
import matplotlib.pyplot as plt
PrecisionRecallDisplay.from_predictions(val_labels, percentages)

Lets go over how this graph is calculated. On each step in the graph, we change the probability boundary, to get a different precision and recall value (also can be applied to TPR and FPR in ROC/AUC graphs) For example, normally we would just take 50% as boundary. All values that are above 0.5 are counted as being 1, all values below 0.5 are counted to 0. So these indices are predicted as false:

In [None]:
np.where(percentages < 0.5)

In [None]:
indices = np.where(percentages < 0.5, 0, 1)
print(metrics.precision_score(val_labels, indices))
print(metrics.recall_score(val_labels, indices))

Notice that this decision boundary is set to 0.5 with `np.where()`, we can change that to 0.9 or 0.4 or anything else.

In [None]:
indices = np.where(percentages < 0.8, 0, 1)
print(metrics.precision_score(val_labels, indices))
print(metrics.recall_score(val_labels, indices))

With the default 0.5 decision boundary, we get a somewhat high `precision`, but our `recall` is still low. Sometimes, we want one over the other, thus it makes sense to change the decision boundary. Say when you want to model to be absolutely sure about its predictions, you could then state that you only allow prediction values higher than 0.9 to be accepted (as 1) and the rest is 0.

Lets compute the values for this precision recall curve, we can also get the thresholds for this curve, so we know at which values the precision and recall was calculated (remember the threshold is our boundary). The amount of threshold values is equal to the number of unique percentages we have calculated from our dataset, e.g. `len(np.unique(percentages))`. By default we calculate the precision and recall for these thresholds, when calculating them for the graph above:

In [None]:
precision, recall, thresholds = metrics.precision_recall_curve(val_labels, percentages)
thresholds

### Exercise 2: 11-points Precision Recall
Let's make an averaged 11-point precision recall graph (similar to the precision recall graph above, but now with only 11 points) , as explained in the lecture, and plot this below. For a refresher, [check here](https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html). You can use the [precision](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html) and [recall](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html) functions from scikit-learn.

In [None]:
# ANSWER HERE

## Metrics in Machine Translation

We now continue to work with machine translation models. We again work with `transformers` pretrained models. This time we work on a translation dataset, that from news articles that are from 2 languages. We ask the model to translate from Dutch to English, and then we test how good the model performs. Don't worry if you don't know Dutch! You do not need to.

Lets import the model first and split the data up in training and test sets.

In [None]:
from datasets import load_dataset

dataset = load_dataset("opus_books", "en-nl")
dataset = dataset["train"].train_test_split(test_size=0.2)

Lets check our dataset.

In [None]:
dataset["train"][0]["translation"]

Alright, lets import an encoder-decoder model. In a later Colab in Course 8 of Advanced Natural Language Processing you will also work on Machine Translation, now we just want you to focus on computing the metrics for these notebooks, namely BLEU and METEOR.

Down here we define a encoder-decoder model `t5-small`. We preprocess the input of these models so that it has a prompt, we can steer the model to come up with a solution then. We tokenize the input, truncate texts that are longer than 256 tokens, and batch them.

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
tokenizer = AutoTokenizer.from_pretrained("t5-small")
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small").to(device)
prefix = "translate Dutch to English: "
def preprocess_function(examples):
    inputs = [prefix + example["nl"] for example in examples["translation"]]
    targets = [example["en"] for example in examples["translation"]]
    return tokenizer(inputs, text_target=targets, max_length=256, truncation=True)
train_data = dataset["train"].map(preprocess_function, batched=True, remove_columns=["id", "translation"])
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

We put the data in a dataloader, which allows us for faster processing, we define an optimizer to train our model. The following code, for 1 epoch, takes about 10 minutes to train. Please make sure you are on a GPU (check instructions at the beginning of this notebook), else it will take longer!

In [None]:
train_dataloader = torch.utils.data.DataLoader(train_data, batch_size=32, collate_fn=data_collator)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

for epoch in range(1):
    for item in tqdm(train_dataloader):
        model.zero_grad()
        loss = model.forward(**item.to(device)).loss
        loss.backward()
        optimizer.step()

We now take the test dataset and do exactly the same, this time we ask the model to come up with the english string he thinks is the translation of the dutch string.

In [None]:
def preprocess_test_function(examples):
    return tokenizer([prefix + example["nl"] for example in examples["translation"]], max_length=256, truncation=True)
test_data = dataset["test"].map(preprocess_test_function, batched=True, remove_columns=["id", "translation"])
test_dataloader = torch.utils.data.DataLoader(test_data, batch_size=32, collate_fn=data_collator)

We let the model generate the english strings and then decode the output tokens back to strings with `batch_decode()`, we skip special tokens in this decoding process that are needed for generation.

In [None]:
outputs = []
for item in tqdm(test_dataloader):
    output = model.generate(**item.to(device))
    outputs.append(output)
translation_results = tokenizer.batch_decode([x for y in outputs for x in y], skip_special_tokens=True)

Lets see what the results looks like!

In [None]:
translation_results[:5]

And now lets see what the original sentences look like (the ones the model should come close to when predicting):

In [None]:
references = [data["translation"]["en"] for data in dataset["test"]]
references[:5]

Hmm, It looks like its far of, but some words are there at least. Now, how do we know for sure the model is performing correctly/incorrectly? We need a metric!

## Bilingual Evaluation Understudy (BLEU)
BLEU is a good way to test how good these models perform. Normally you would just import a bleu metric from packages such as `NLTK` or `torchtext` and calcuate the score, but we are going to do it by hand (fun!). Lets implement $BLEU_1$, which means we focus ourselves on only unigrams (single words) (BLEU 1 is the BLEU score for n-grams where n = 1, so the UNIGRAMS. It look for single characters, not bi or trigrams). There is no need to include the more complex brevity penalties or other complex variation on the default BLEU-1.

Our candidate sentences are defined in `results`. Our reference sentences are defined in `references`.

We iterate over the results and references, and compute on each iteration for both result and reference the ngrams. We do this by counting the words/tokens in the sentence.


### Exercise 3: BLEU
> 1. Implement the BLEU score yourself, by using the formulas from [wikipedia](https://en.wikipedia.org/wiki/BLEU) or the slides.
2. Apply the BLEU score on one of the references and candidates to see if your implementation works.


In [None]:
# WRITE BLEU SCORE HERE

## Language Model Evaluation

We can also know how good our model performs by calculating the perplexity. For an encoder-decoder model, the model is trained on a cross entropy loss, so we just do:

In [None]:
perplexity = torch.exp(loss)
perplexity.item()

### Exercise 4: Perplexity
> 1. Research what perplexity is, and if our score above is good or not.
2. Are there also other metrics or ways how we can measure language models?

ANSWER HERE:

4.1:


4.2:




#Submission
Please share your Colab notebook by clicking File on the top-left corner. Click under Download on Download .ipynb and upload that file to Canvas.