# Lab Instructions

In the lab, you're presented a task such as building a dataset, training a model, or writing a training loop, and we'll provide the code structured in such a way that you can fill in the blanks in the code using the knowledge you acquired in the chapters that precede the lab. You should be able to find appropriate snippets of code in the course content that work well in the lab with minor or no adjustments.

The blanks in the code are indicated by ellipsis (`...`) and comments (`# write your code here`).

In some cases, we'll provide you partial code to ensure the right variables are populated and any code that follows it runs accordingly.

```python
# write your code here
x = ...
```

The solution should be a single statement that replaces the ellipsis, such as:

```python
# write your code here
x = [0, 1, 2]
```

In some other cases, when there is no new variable being created, the blanks are shown like in the example below: 

```python
# write your code here
...
```

Although we're showing you only a single ellipsis (`...`), you may have to write more than one line of code to complete the step, such as:

```python
# write your code here
for i, xi in enumerate(x):
    x[i] = xi * 2
```

## Installation Notes

To run this notebook on Google Colab, you will need to install the following libraries: datasets.

In Google Colab, you can run the following command to install them:

In [None]:
!pip install datasets

## 8.5 Lab 4: Sentiment Analysis

In this lab, you'll fine-tune an encoder-based model to perform sentiment analysis on the Standford Sentiment Treebank (SST2) dataset. You'll load RoBERTa's sibling, XLM-RoBERTa, use its prescribed transformations to preprocess text in the SST2 dataset, and fine-tune (train) it for one epoch.

### 8.5.1 Model

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step1.png)

You'll use Hugging Face's `XLMRobertaForSequenceClassification` to perform binary classification (we have two classes, "positive" and "negative" sentiment).

In [None]:
from transformers import XLMRobertaForSequenceClassification
repo_id = "FacebookAI/xlm-roberta-base"

# write your code here
model = ...
model

### 8.5.2 Dataset

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step1.png)

Now, you will load Hugging Face's ["Stanford Sentiment Treebank (SST2)"](https://huggingface.co/datasets/stanfordnlp/sst2) dataset. It is already split into `train`, `validation`, and `test` sets.

In [None]:
from datasets import load_dataset

# write your code here
datasets = ...
datasets

Let's take a look at one data point from the SST2 dataset. Just run the code below as is to visualize the output:

In [None]:
row = datasets['train'][0]
text, label = row['sentence'], row['label']
text, label

Each data point is a dictionary, containing a line of text, and the corresponding label - the sentiment (0 for negative, 1 for positive).

### 8.5.3 Tokenizer

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step3.png)

You already know the drill: you must preprocess the input (the text) using the prescribed transformation for the model you're using, so it gets tokenized, converted into token ids, and prependend/appended with the appropriate special tokens.

Load XLM-RoBERTa's tokenizer and write a function that takes a dictionary with the `sentence` key (it may have other keys as well) and returns a dictionary with `input_ids`, `attention_mask` keys (remember that the `map()` method of HF datasets work by _merging_ dictionaries):

In [None]:
from transformers import XLMRobertaTokenizer

# write your code here
tokenizer = ...

In [None]:
def apply_transform(row):
    text = row['sentence']
    # Use the transform_fn you retrieved in the previous cell to
    # preprocess the text
    # write your code here
    ...

Let's apply your function to our data point to see if it is working as expected (just run the code below as is to visualize the output):

In [None]:
apply_transform(row)

Now, apply the function to every row in our datasets:

In [None]:
# write your code here
datasets = ...
datasets

To keep our datasets tidy, selct only the columns we're interested in (`input_ids` and `label`):

In [None]:
# write your code here
datasets = ...
datasets['train'][:3]

Did you notice the transformation is returning a regular Python list of token ids, not a PyTorch tensor? Remember, we cannot make a tensor out of lists of different lengths (see section 6.3.3). The solution? Padding the shorter sentences, so they all have the same length.

But, how can we think of padding sentences if we don't have a mini-batch yet? We delegate this job to the dataloaders's collate function!

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step5.png)

So far, we've been using data loaders without specifying a collate function, that is, we're using its default collate function. For tabular data, the default collator is more than enough. It simply stacks several data points together and, since they all have the same size, it works smoothly. But this strategy breaks apart when we're dealing with sequences of different lengths, as we've already experienced while trying to make a tensor out of them.

Just like before, padding is the solution for our problem, and we're using a [collator](https://huggingface.co/learn/nlp-course/en/chapter3/2#dynamic-padding) designed to automatically pad the sequences before stacking them together: HF's `DataCollatorWithPadding`. It takes the tokenizer as an argument in order to determine which token is the padding token, and which side (left or right) should be padded.

Let's try it on a slice of four sequences from our training set:

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
data_collator(datasets['train'][:4])

You can easily recognize the padding tokens sitting at the right end of the sequences (a sequence of ones). Moreover, the location of every padding token is indicated by the sequence's corresponding attention mask. The masks tell the model which tokens should be considered (value of one) or ignored (value of zero).

Next, let's assign this data collator to each dataloader:

In [None]:
from torch.utils.data import DataLoader

dataloaders = {}
# write your code here
dataloaders['train'] = ...
dataloaders['val'] = ...

Now, let's fetch a mini-batch from our data loader (just run the code below as is to visualize the output):

In [None]:
dl_out = next(iter(dataloaders['train']))
dl_out

As you can see, there are plenty of padding tokens there. The collator will always pad the sequences to match the longest sequence in a particular mini-batch. This means that mini-batches may have sequences of different lengths (when compared across mini-batches but not inside the same one).

### 8.5.4 Training

Now, it is time to write a training loop to fine-tune your XLM-RoBERTa model on the SST2 dataset. This is a large model, and the training set has over 60,000 data points, so you can train it over a single epoch, that is, looping over the mini-batches from the datapipe (or data loader) only once. For the sake of speed, keep the evalution for the end only.

#### 8.5.4.1 Loss Function

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step2.png)

Sentiment analysis is a classification task, so we need to use the appropriate loss function for the task. Even though it is a binary classification, RoBERTa's classification head is actually producing two logits instead of one, so you have to use `CrossEntropyLoss` (which can handle two or more logits).

In [None]:
import torch.nn as nn

loss_fn = ...

This step is actually redundant now. Since we're using a HF model, the loss is automatically returned when the model is in training model. We simply retrieve the loss from the output's `loss` attribute.

#### 8.5.4.2 Optimizer

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step3.png)

Although `Adam` is the optimizer of choice, we suggest you try out `AdamW`, a modified version that is also commonly used.

In [None]:
import torch.optim as optim

# suggested learning rate
lr = 1e-5

optimizer = ...

#### 8.4.4.2 Training Loop

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step4.png)

So far, we haven't logged or inspected our losses in real-time. Why bother, if it takes only a minute to train the model? This time is different, though: fine-tuning RoBERTa on more than 67,000 data points, even for a single epoch, will take about 15 min or so in Google Colab. So, let's use TensorBoard to see how our loss is doing as training progresses.

First, we need to load it using the corresponding Jupyter magic (just run the code below as is to load TensorBoard):

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs

Next, we need to create an instance of the `SummaryWriter` to be able to send loss values to TensorBoard. Just run the code below as is to create it:

In [None]:
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter('runs/roberta')

Now, it's your turn to write the missing parts of the training loop below. We have already taken care of the sending the losses to TensorBoard for you.

In [None]:
from tqdm import tqdm

device = 'cuda' if torch.cuda.is_available() else 'cpu'

model.to(device)

batch_losses = []

## Training
for i, batch in tqdm(enumerate(datapipes['train'])):
    # Set the model's mode
    # write your code here
    ...
    
    # Send input_ids, labels, and attention masks to the device
    # write your code here
    ...
    
    # Step 1 - forward pass
    # write your code here
    output = ...
    predictions = output.logits

    # Step 2 - computing the loss
    loss = output.loss

    # Step 3 - computing the gradients
    # Tip: it requires a single method call to backpropagate gradients
    # write your code here
    ...

    batch_losses.append(loss.item())
    
    writer.add_scalars(main_tag='loss',
                       tag_scalar_dict={'training': loss.item()},
                       global_step=i)    

    # Step 4 - updating parameters and zeroing gradients
    # Tip: it takes two calls to optimizer's methods
    # write your code here
    ...


writer.close()

## Validation   
with torch.inference_mode():
    val_losses = []

    for i, val in enumerate(dataloaders['val']):
        # Set the model's mode
        # write your code here
        ...

        # Send input_ids, labels, and attention masks to the device
        # write your code here
        ...

        # Step 1 - forward pass
        # write your code here
        output = ...
        predictions = output.logits

        # Step 2 - computing the loss
        loss = output.loss
        
        val_losses.append(loss.item())

By the end of it, your losses on TensorBoard should look more or less like this (if you drag the slider on the right to the maximum level of smoothing):

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch6/tensorboard.png)

### 8.5.5 Inference

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step5.png)

Write a function that takes some text (a sequence of words), a model, its tokenizer, and a list of target categories for the classification, and returns the most likely category and the corresponding probability.

Since you're handling a single sequence, there's no need for any padding, but you still need to provide a tensor containing a mini-batch (of one) as input to the model.

The model returns two logits, one for each class, so you must use the softmax function to convert them into probabilities.

In [None]:
def predict(sequence, model, tokenizer, categories):        
    # Build a tensor of token ids out of the input sequence
    # write your code here
    ...

    # Set the model to the appropriate mode
    # write your code here
    ...

    device = next(iter(model.parameters())).device
    
    # Use the model to make predictions/logits
    # Tip: Don't forget to send the input to the same device as the model
    # write your code here
    pred = ...
    
    # Compute the probabilities corresponding to the logits
    # and return the top value and index
    # write your code here
    probabilities = ...
    values, indices = ...
    
    return [{'label': categories[i], 'value': v.item()} for i, v in zip(indices, values)]

Now, try out your prediction function and fine-tuned model (just run the code cells below as they are to visualize their outputs):

In [None]:
categories = ['negative', 'positive']
text = "I am really liking this course"
predict(text, model, tokenizer, categories)

In [None]:
text = "This course is too complicated!"
predict(text, model, tokenizer, categories)

That's cool, but what if we could perform sentiment analysis out-of-the-box? That's what we'll do in the second part of Chapter 6.