<a href="https://colab.research.google.com/github/larajakl/Computational-Linguistics/blob/main/tutorial4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial 4: Introduction to Computational Linguistics

This is the fourth tutorial with practical exercises for the lecture Introduction to Computational Linguistics in the winter semester 2024. Hands-on exercises are marked with 👋 ⚒ and questions are marked with ❓. Remember to first **store this notebook** in your Drive or GitHub.

## **Lesson 1: Hugging Face Tutorial**

Hugging Face is a platform that provides access to models, datasets, and metrics. It mostly provides implementations in PyTorch and TensorFlow.


In this tutorial, we will first focus on tokenizers and models. Afterwards, we will look into fine-tuning models. For further reading and to help your project, please check the [Hugging Face Documentation and Tutorials](https://huggingface.co/docs/transformers/index).

As a first step, we need to install the libraries `transformers`to load any transformer models and the library `datasets`to have access to all the Hugging Face datasets.

In [None]:
!pip install transformers
!pip install datasets
!pip install evaluate
!pip install bertviz transformers
!pip install accelerate --upgrade



When using Hugging Face, there are two very important objects to be initialized, a **tokenizer** and a **model**.



*   **Tokenizer**: converts strings or text to lists of vocabularies required by the model
*   **Model**: takes the tokenized datasets, i.e., the vocabulary ids, and can be trained to produce a prediction

We will start looking into the tokenizer first.

### Tokenizer

Pre-trained language models are implemented with a tokenizer that processes their input. You can either access teoknizers with the class specific to the model, i.e., DistilBERT, or use the `AutoTokenizer` class that defaults to the optimized tokenizer of the models. The alternative would be to use `DistilBertTokenizerFast` directly.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")
print(tokenizer)

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert/distilbert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}


We will now input a sample sentence to the tokenizer to create a list of vocabularies with related ids and see how to access the ids the tokenizer provides.

The resulting datatype is called ` BatchEncoding`, which holds the output of a pretrained batch encoding method and is derived from a Python dictionary.

**❓** Why did the tokenizer generate more IDs than there are words in the input sentence?

In [None]:
input_str = "Hugging Face is great!"
tokenized_inputs = tokenizer(input_str)

# Two ways to access:
print(tokenized_inputs.input_ids)
print(tokenized_inputs["input_ids"])

[101, 20164, 10932, 10289, 1110, 1632, 106, 102]
[101, 20164, 10932, 10289, 1110, 1632, 106, 102]


Tokenization is a process of several steps. First the string is tokenized, which is slightly different from the tokenization of Tutorial 2, and then the tokens converted to IDs as shown below.

The tokens shown represent so-called wordpieces, which is the result of a subword tokenization algorithm underlying the AutoTokenizer algorithms. The algorithm is called WordPiece and was developed by Google when pretraining BERT. The characters `##` are a WordPiece prefix to indicate wordpieces inside a word. See the following tutorial for [more information on WordPiece](https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt).

In [None]:
input_tokens = tokenizer.tokenize(input_str)
print(f"Tokens of the input sequence: {input_tokens}")
input_ids = tokenizer.convert_tokens_to_ids(input_tokens)
print(f"IDs assigned to the intput sequence: {input_ids}")

Tokens of the input sequence: ['Hu', '##gging', 'Face', 'is', 'great', '!']
IDs assigned to the intput sequence: [20164, 10932, 10289, 1110, 1632, 106]


To convert existing IDs back to text, we can use the function `decode`.



In [None]:
decoded = tokenizer.decode(input_ids)
print(decoded)

Hugging Face is great!


👋 ⚒ Decode the original tokenized sequence `tokenized_inputs` in the code cell below. First, find out and print which datatype this variable is. Remember that the `decode` function only takes IDs as input.

In [None]:
# Decode the original variable tokenized_inputs
print(type(tokenized_inputs))
print(tokenized_inputs)  # looks like dictionary. we don't care about attention_mask now because we didnt train it
# it is batch encoding and it looks like a dictionary so I can treat it like a dictionary!

print(tokenizer.decode(tokenized_inputs["input_ids"]))  # another way to go from ids to sequence. closer to input ids than other way we did before. just
# different ways to get output from tokenizer.

<class 'transformers.tokenization_utils_base.BatchEncoding'>
{'input_ids': [101, 20164, 10932, 10289, 1110, 1632, 106, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
[CLS] Hugging Face is great! [SEP]


So the mystery of more IDs than visible words in the input sequence can partially be answered by the tokenization and by the fact that a special token [CLS] at the beginning of a sequence and [SEP] at the end of a sequence are added for BERT-type models.

Another way to reach a similar result as with `decode` can be seen in the next code cell. The difference is that with this method the special tokens are also represented.

In [None]:
inputs = tokenizer._tokenizer.encode(input_str)  # useful if I need tokens for something or if I need more details on how tokenization works
special = inputs.tokens
print(special)

# basically what i need to know is how to tokenise and how to get my input again

['[CLS]', 'Hu', '##gging', 'Face', 'is', 'great', '!', '[SEP]']


Converting the output of the tokenizer to a PyTorch tensor can be done easily by adding `return_tensor = pt`to the tokenization process. Compare the following output with the output of the previous code cell.

In [None]:
model_inputs = tokenizer("Hugging Face is great!", return_tensors="pt")  # useful if I want to work with pytorch tensors. this automatically gives me pytorch tensor.
print(model_inputs)

{'input_ids': tensor([[  101, 20164, 10932, 10289,  1110,  1632,   106,   102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}


We can also tokenize a number of sentences at once. To ease processing, it is common to make all sentences of the same length by padding, that is, adding the `PAD' token to sequences to turn them all into sequences of the same length.

In [None]:
model_inputs = tokenizer(["Hugging Face is great!",  # batches used to separate dataset into batches (not sentences) and use these when training.
                         "The quick brown fox jumps over the lazy dog.",
                         "We are learning to fine-tune models.",
                         ],
                         return_tensors="pt",
                         padding=True,
                         truncation=True)
print(model_inputs)
print(tokenizer.pad_token, tokenizer.pad_token_id)

# output automatically produces 0s because of setting padding True. truncation means it would cut off sequences that are
# too long for the model settings -> these parts are then lost -> older models so amount of words for input at once is
# shorter -> padding and truncation kind of opposites!

{'input_ids': tensor([[  101, 20164, 10932, 10289,  1110,  1632,   106,   102,     0,     0,
             0,     0],
        [  101,  1109,  3613,  3058, 17594, 15457,  1166,  1103, 16688,  3676,
           119,   102],
        [  101,  1284,  1132,  3776,  1106,  2503,   118,  9253,  3584,   119,
           102,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]])}
[PAD] 0


In order to decode a whole set of sentences, we can use the function batch decode.

In [None]:
print(tokenizer.batch_decode(model_inputs.input_ids))  #in our case we have a batch of 3 sentences
print()
print("Omitting special characters when decoding:")
print(tokenizer.batch_decode(model_inputs.input_ids, skip_special_tokens=True))

['[CLS] Hugging Face is great! [SEP]']

Omitting special characters when decoding:
['Hugging Face is great!']


## Model

The way models are initialized in Hugging Face is achieved with a similar code as initializing Tokenizers. There are model-specific classes or the AutoModel classes, which is preferable when comparing different models.

Hugging Face automatically sets up the architecture you need for a specific task when you specify the model class, e.g. `AutoModelForSequenceClassification` respectively the model-specific `DistilBertForSequenceClassification` need to be used when you want to do sentiment analysis, question-answering, etc. For training a model on the masked language task, you need to use other classes, such as `DistilBertForMaskedLM`. More details can be found [here](https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt).


The main three types of models are:


*   Encoders, e.g. BERT
*   Decoders, e.g. GPT-2
*   Encoder-Decoders, e.g. BART or T5, which are Machine Translation (MT) models



In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-cased', num_labels=2)

We get this warning because the sequence classification parameters of the model have not yet been trained.

To input a string into a model requires it to be tokenized first. Then the input is very easy. The output consists of two classes since we indicated `num_labels = 2`.

In [None]:
import torch

model_inputs = tokenizer(input_str, return_tensors="pt")

model_outputs = model(**model_inputs)

print(model_inputs)
print(model_outputs)

print("To convert the logits outputs to probabilities, we can use softmax again:", torch.softmax(model_outputs.logits, dim=1))

Now we have two output classes for a binary classification task, which is only due to the fact how Hugging Face calculates the loss.

We can now use the model logits to calculate the loss, i.e., go from the model predictions to the intended label and calculate the difference with any suitable loss function. We will first do this using PyTorch.

For this example, we pretend that Label 1 is the correct label and pass this information to a loss function. Since this is a binary classification task, it makes more sense to use sigmoid or cross-entropy rather than softmax as a loss function.

In [None]:
label = torch.tensor([1])
loss = torch.nn.functional.cross_entropy(model_outputs.logits, label)
print(loss)
loss.backward()

# You can get the parameters
list(model.named_parameters())[0]

We can also calculate the loss directly with Hugging Face.

In [None]:
model_inputs = tokenizer(input_str, return_tensors="pt")

labels = ['NEGATIVE', 'POSITIVE']
model_inputs['labels'] = torch.tensor([1])

model_outputs = model(**model_inputs)


print(model_outputs)
print()
print("Model predictions: ", labels[model_outputs.logits.argmax()])

To analyze in more detail what happens during training, we can visualize the attention weights and hidden states.

We can use BertViz to explicitly see the change of the attention weights for each layer.

In [None]:
from bertviz import model_view, head_view

# We need to initialize the model with setting output attentions explicitly to True
model = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-cased', num_labels=2, output_attentions=True)
# We then need to explicitly encode the tokens
inputs = tokenizer.encode(input_str, return_tensors='pt')
outputs = model(inputs)

attention = outputs[-1]
tokens = tokenizer.convert_ids_to_tokens(inputs[0])
head_view(attention, tokens)

There is also another method to analyze hidden states and attention weights.

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained("distilbert/distilbert-base-cased", output_attentions=True, output_hidden_states=True)
model.eval()

model_inputs = tokenizer(input_str, return_tensors="pt")
with torch.no_grad():
    model_output = model(**model_inputs)


print("Hidden state size (per layer):  ", model_output.hidden_states[0].shape)
print("Attention head size (per layer):", model_output.attentions[0].shape)     # (layer, batch, query_word_idx, key_word_idxs)
                                                                               # y-axis is query, x-axis is key
print(model_output)

In [None]:
from matplotlib import pyplot as plt

tokens = tokenizer.convert_ids_to_tokens(model_inputs.input_ids[0])
print(tokens)

n_layers = len(model_output.attentions)
n_heads = len(model_output.attentions[0][0])
fig, axes = plt.subplots(6, 12)
fig.set_size_inches(18.5*2, 10.5*2)
for layer in range(n_layers):
    for i in range(n_heads):
        axes[layer, i].imshow(model_output.attentions[layer][0, i])
        axes[layer][i].set_xticks(list(range(8))) # 8 is the number of wordpieces in our example
        axes[layer][i].set_xticklabels(labels=tokens, rotation="vertical")
        axes[layer][i].set_yticks(list(range(8))) # 8 is the number of wordpieces in our example
        axes[layer][i].set_yticklabels(labels=tokens)

        if layer == 5:
            axes[layer, i].set(xlabel=f"head={i}")
        if i == 0:
            axes[layer, i].set(ylabel=f"layer={layer}")

plt.subplots_adjust(wspace=0.3)
plt.show()

## Fine-Tuning

For your final projects, you will need to finetune a pretrained language model.

In addition to models, Hugging Face also provides a large repository of datasets.

### **Loading the data**

For this example we are going to work with the `imdb` dataset, which is a Large Movie Review Dataset.

We will use the native PyTorch version to load the dataset.




In [None]:
from datasets import load_dataset, DatasetDict

imdb_dataset = load_dataset("imdb")


# Just take the first 50 tokens for speed on cpu
def truncate(example):
    return {
        'text': " ".join(example['text'].split()[:50]),
        'label': example['label']
    }

# Take 128 random examples for train and 32 validation
small_imdb_dataset = DatasetDict(
    train=imdb_dataset['train'].shuffle(seed=24).select(range(128)).map(truncate),
    val=imdb_dataset['train'].shuffle(seed=24).select(range(128, 160)).map(truncate),
)

In [None]:
small_imdb_dataset

👋 ⚒ Print the first ten examples of the training dataset of the `small_imdb_dataset`.

In [None]:
# Your code here

### Loading the data for use in PyTorch

We need to prepare the dataset as input to the model by tokenization and padding. We also need to:

1. Remove the `text` column because the model does not accept raw text as an input:

    ```py
    >>> tokenized_datasets = tokenized_datasets.remove_columns(["text"])
    ```

2. Rename the `label` column to `labels` because the model expects the argument to be named `labels`:

    ```py
    >>> tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
    ```

3. Set the format of the dataset to return PyTorch tensors instead of lists:

    ```py
    >>> tokenized_datasets.set_format("torch")
    ```

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True)

small_tokenized_dataset = small_imdb_dataset.map(tokenize_function, batched=True, batch_size=16)
small_tokenized_dataset = small_tokenized_dataset.remove_columns(["text"])
small_tokenized_dataset = small_tokenized_dataset.rename_column("label", "labels")
small_tokenized_dataset.set_format("torch")

We can now check what the first two sequences of the tokenized training dataset looks like.

In [None]:
small_tokenized_dataset['train'][0:2]

We then create a `DataLoader` for your training and test datasets so we can iterate over batches of data:

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(small_tokenized_dataset['train'], batch_size=16)
eval_dataloader = DataLoader(small_tokenized_dataset['val'], batch_size=16)

### **Training**

Hugging Face models also use `torch.nn.Module` like you did in Tutorial 3, which means backpropagation happens the same way and the same optimizers can be used. Hugging Face includes optimizers and learning rate schedules, i.e., changes along the training process, to train Transformer models.

With Stochastic Gradient Descent the learning rate does not change during training. Adam represents an optimizer that extends SGD by providing an Adaptive Gradient Algorithm (AdaGrad) and an adaptation depending on the recent magnitude of gradients called Root Mean Square Propagation (RMSProp). For more details on different optimizers, see for instance [this information](https://www.ruder.io/optimizing-gradient-descent/#adam).


### Training Loop with Hugging Face Trainer

A [Hugging Face tutorial](https://huggingface.co/docs/transformers/training) on both variants of training is available. Here we will use this `Trainer` class that covers most needs.

Just to be sure tht we have the right settings, we load the dataset again and tokenize it.

In [None]:
from datasets import load_dataset, DatasetDict
from transformers import DataCollatorWithPadding

imdb_dataset = load_dataset("imdb")
# we had loaded the imdb dataset already above - if not, outcomment this line
# Make sure you have the right tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-cased")


# Just take the first 50 tokens for speed on CPU
def truncate(example):
    return {
        'text': " ".join(example['text'].split()[:100]),
        'label': example['label']
    }

# Take 128 random examples for train and 32 validation
small_imdb_dataset = DatasetDict(
    train=imdb_dataset['train'].shuffle(seed=24).select(range(128)).map(truncate),
    val=imdb_dataset['train'].shuffle(seed=24).select(range(128, 160)).map(truncate),
)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding=True, truncation=True)

small_tokenized_dataset = small_imdb_dataset.map(tokenize_function, batched=True, batch_size=16)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

We can specify all the training hyperparameters by using the `TrainningArguments`class, which is detailed [here](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments).

The `Trainer` then performs the training und you can pass all the arguments to this class, even model checkpoints to resume training later. What was the validation part in the loop above is now the `compute_metrics` function.

In [None]:
import numpy as np
import evaluate
from transformers import TrainingArguments, Trainer
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('distilbert/distilbert-base-cased', num_labels=2)
accuracy = evaluate.load("accuracy")

arguments = TrainingArguments(
    output_dir="sample_cl_trainer",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    logging_steps=8,
    num_train_epochs=5,
    eval_strategy="epoch", # run validation at the end of each epoch
    save_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    report_to='none',
    seed=224
)

def compute_metrics(eval_pred):
    """Called at the end of validation. Gives accuracy"""
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculates the accuracy
    return accuracy.compute(predictions=predictions, references=labels)


trainer = Trainer(
    model=model,
    args=arguments,
    train_dataset=small_tokenized_dataset['train'],
    eval_dataset=small_tokenized_dataset['val'], # change to test when you do your final evaluation!
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

Once the training has been completed, evaluate the model, which is very easy with Hugging Face.

In [None]:
results = trainer.predict(small_tokenized_dataset['val'])
print(results)

Then, we want to load the fine-tuned model for evaluation. We can load our models just like we load models from Hugging Face only that we need to pass the path to our models to the function `from_pretrained()`. We had 24 training steps (3 epochs times 8, where the 8 is the total number of training data (128) divided by the batch size (16)) so we can load a checkpoint for each of these steps.  

In [None]:
test_str = "I love this movie!"

fine_tuned_model = AutoModelForSequenceClassification.from_pretrained("sample_cl_trainer/checkpoint-40")
model_inputs = tokenizer(test_str, return_tensors="pt")
prediction = torch.argmax(fine_tuned_model(**model_inputs).logits)
print(["NEGATIVE", "POSITIVE"][prediction])

### Training Loop in PyTorch

First, we will use native PyTorch for the training loop and training, since it is more transparent than the built-in Hugging Face functions.

In this example we are going to use AdamW Optimizer, an extension of Adam with weight decay. And we're using a linear learning rate scheduler, which reduces the learning rate a little bit after each training step over the course of training.

Let's load our model, optimizer, and learning rate scheduler. For this we need to the basic hyperparameters of the number of epochs we wish to train for and the number of training steps that depends on the size of the training dataset. We again get the same warning as before about this model not having been trained on sequence classification.

To keep track of your training progress, use the [tqdm](https://tqdm.github.io/) library to add a progress bar over the number of training steps.

We we also include a validation step that test the current state of the model on the validation dataset and saves a version of the model called `checkpoint`. To save the different model states as checkpoints, we first create a folder called checkpoints.  

You might also want to consider early stopping, that is, interrupting the training loop when a certain threshold (value) has been reached. More information on that can be found [here](https://huggingface.co/docs/transformers/main_classes/callback#transformers.TrainerCallback).

In [None]:
from transformers import AutoModelForSequenceClassification
from transformers import AdamW, get_linear_schedule_with_warmup
from tqdm.notebook import tqdm

model = AutoModelForSequenceClassification.from_pretrained("distilbert/distilbert-base-cased", num_labels=5)

num_epochs = 3
num_training_steps = 3 * len(train_dataloader)
optimizer = AdamW(model.parameters(), lr=5e-5, weight_decay=0.01)
lr_scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=num_training_steps)

Last you can specify to use a `GPU`if you have access to one, e.g. on Colab, or else use a `CPU` if there is no access. Today, we will only train with `CPU`so no need to run the following cell.

In [None]:
import torch

#device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
#model.to(device)

In [None]:
!mkdir checkpoints

In [None]:
from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))
best_val_loss = float("inf")

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        #batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # validation
    model.eval()
    for batch_i, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            output = model(**batch)
        loss += output.loss

    avg_val_loss = loss / len(eval_dataloader)
    print(f"Validation loss: {avg_val_loss}")
    if avg_val_loss < best_val_loss:
        print("Saving checkpoint!")
        best_val_loss = avg_val_loss
        model.save_pretrained(f"checkpoints/epoch_{epoch}.pt")AutoTokenizer