# Lab Instructions

In the lab, you're presented a task such as building a dataset, training a model, or writing a training loop, and we'll provide the code structured in such a way that you can fill in the blanks in the code using the knowledge you acquired in the chapters that precede the lab. You should be able to find appropriate snippets of code in the course content that work well in the lab with minor or no adjustments.

The blanks in the code are indicated by ellipsis (`...`) and comments (`# write your code here`).

In some cases, we'll provide you partial code to ensure the right variables are populated and any code that follows it runs accordingly.

```python
# write your code here
x = ...
```

The solution should be a single statement that replaces the ellipsis, such as:

```python
# write your code here
x = [0, 1, 2]
```

In some other cases, when there is no new variable being created, the blanks are shown like in the example below: 

```python
# write your code here
...
```

Although we're showing you only a single ellipsis (`...`), you may have to write more than one line of code to complete the step, such as:

```python
# write your code here
for i, xi in enumerate(x):
    x[i] = xi * 2
```

## Installation Notes

To run this notebook on Google Colab, you will need to install the following libraries: transformers, evaluate, and portalocker.

In Google Colab, you can run the following command to install these libraries:

In [None]:
!pip install transformers evaluate portalocker

## 15.10 Lab 6: Text Classification using Embeddings

It is time to get our hands dirty! Let's use GloVe pretrained word embeddings as features for a multi-class linear classification model. It works like a linear regression model, but it produces four logits as output (one for each class in the AG News Dataset), and we'll use the softmax function to convert the logits into probabilities.

### 15.10.1 Recap

In the last chapter, we created "raw" data pipes that load the CSV files from the AG News Dataset, clean them up of special characters and HTML tags, and discard the title information, returning only labels and (cleaned) descriptions. Let's quickly retrace our steps here to prepare the dataset.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step1.png)

First, we need to download the dataset. You can dowload the files from the following links:
- `https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv`
- `https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv`
- `https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/classes.txt`

Alternatively, you can download all files as a single compressed file instead:

```
https://github.com/dvgodoy/assets/raw/main/PyTorchInPractice/data/AGNews/agnews.zip
```

If you're running Google Colab, you can download the files using the commands below:

In [None]:
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/classes.txt

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step2.png)

Next, let's do some data cleaning, getting rid of a few HTML tags, replacing some special characters, etc. Here is a non-exhaustive list of characters and tags for replacement:

In [None]:
import numpy as np

chr_codes = np.array([
     36,   151,    38,  8220,   147,   148,   146,   225,   133,    39,  8221,  8212,   232,   149,   145,   233,
  64257,  8217,   163,   160,    91,    93,  8211,  8482,   234,    37,  8364,   153,   195,   169
])
chr_subst = {f' #{c};':chr(c) for c in chr_codes}
chr_subst.update({' amp;': '&', ' quot;': "'", ' hellip;': '...', ' nbsp;': ' ', '&lt;': '', '&gt;': '',
                  '&lt;em&gt;': '', '&lt;/em&gt;': '', '&lt;strong&gt;': '', '&lt;/strong&gt;': ''})

And here are a couple of helper functions we used to perform the cleanup:

In [None]:
def replace_chars(sent):
    to_replace = [c for c in list(chr_subst.keys()) if c in sent]
    for c in to_replace:
        sent = sent.replace(c, chr_subst[c])
    return sent

def preproc_description(desc):
    desc = desc.replace('\\', ' ').strip()
    return replace_chars(desc)

Then, we used those functions to create a "raw" datapipe that loads the data from a CSV file, parses it, and applies the functions above to clean up the text. The function below also converts the label into a 0-based numeric value, and keeps only labels and clean up text.

In [None]:
from torchdata.datapipes.iter import FileLister
from torch.utils.data import DataLoader

def create_raw_datapipe(fname):
    datapipe = FileLister(root='.')
    datapipe = datapipe.filter(filter_fn=lambda v: v.endswith(fname))
    datapipe = datapipe.open_files(mode='rt', encoding="utf-8")
    datapipe = datapipe.parse_csv(delimiter=",", skip_lines=0)
    datapipe = datapipe.map(lambda row: (int(row[0])-1, preproc_description(row[2])))
    return datapipe

In the previous chapter, we didn't actually train any models, so we didn't bother shuffling the training set. In this lab, however, we should:

In [None]:
datapipes = {}
datapipes['train'] = create_raw_datapipe('train.csv').shuffle(buffer_size=125000)
datapipes['test'] = create_raw_datapipe('test.csv')

### 15.10.2 Tokenizing and Embedding

Let's plan ahead what needs to be done:
- create data loaders, one for each data pipe
- write a function that tokenizes the sentences in a given batch
- retrieve the word embeddings for each and every token
- create a linear model that takes the embedding vectors as features
- create the appropriate loss function and optimizer
- write a training loop

Create two data loaders, one for each data pipe (training and validation/test). For now, use a small batch size, such as four, to be able to more easily peek at the values. Later on, you'll recreate the data loader with a more appropriate batch size.

In [None]:
from torch.utils.data import DataLoader

dataloaders = {}
# write your code here
dataloaders['train'] = DataLoader(dataset=datapipes['train'], batch_size=4, shuffle=True)
dataloaders['test'] = DataLoader(dataset=datapipes['test'], batch_size=4)

Fetch one mini-batch of data to make sure it's working fine. Just run the code below as is to visualize the output:

In [None]:
labels, sentences = next(iter(dataloaders['train']))
labels, sentences

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step3.png)

Now, write a function that tokenizes a mini-batch of sentences. The function must take as arguments:
- a tuple or list containing multiple sentences (as returned by the data loader)
- an optional tokenizer: if the tokenizer isn't provided, it should fall back to the default `basic_english` tokenizer we have been using

The function must return a list of lists of tokens.

In [None]:
from torchtext.data import get_tokenizer

def tokenize_batch(sentences, tokenizer=None):
    # Create the basic tokenizer if one isn't provided
    # write your code here
    if tokenizer is None:
        tokenizer = get_tokenizer('basic_english')
    
    # Tokenize sentences and returns the result
    # write your code here
    return [tokenizer(s) for s in sentences]

Try your function out and assign its output to the `tokens` variable. Just run the code below as is to visualize the output:

In [None]:
tokens = tokenize_batch(sentences)
tokens

More likely than not, each sentence in a mini-batch has different number of tokens in it. How many tokens are there in each sentence? Just run the code below as is to see the answer:

In [None]:
[len(s) for s in tokens]

Now, let's briefly discuss two different approaches to handling this issue.

#### 15.10.2.1 Alternative 1: Padding

Did padding come to your mind? We have taken this approach time and again. However, we've always performed it on top of token indices, not tokens themselves: that's what we used `ToTensor()` for.

Now, you'll write a function called `fixed_length()` that combines both truncating and padding operations at token (word) level. The function must take as arguments:
- a list of lists of tokens (as returned by the `tokenize_batch()` function)
- the maximum length of tokens, above which they are truncated
- the string that represents the padding token (default `<pad>`)

The function must truncate sequences of tokens that are too long and, afterward, pad the sequences so the shorter ones match the length of the longest.

It must return a list of lists of tokens, every inner list having the same length.

In [None]:
def fixed_length(tokens_batch, max_len=128, pad_token='<pad>'):
    # Truncate every sentence to max_len
    # write your code here
    truncated = [s[:max_len] for s in tokens_batch]
    
    # Check the actual maximum length of the (truncated) inputs
    # write your code here
    current_max = max([len(s) for s in truncated])
    
    # Appends as many padding tokens as necessary to make every
    # sentence as long as the actual maximum length
    # write your code here
    padded = [s + [pad_token]*(current_max-len(s)) for s in truncated]
    return padded

Double-check that every inner list has the same length, as expected. Just run the code below as is to visualize the output:

In [None]:
lengths = [len(s) for s in fixed_length(tokens)]
lengths

Same length everywhere? Great!

Now, run the code below to load and uncompress GloVe vectors:

In [None]:
import os
from torchtext.vocab import GloVe

new_locations = {key: os.path.join('https://huggingface.co/stanfordnlp/glove/resolve/main',
                                   os.path.split(GloVe.url[key])[-1]) for key in GloVe.url.keys()}
GloVe.url = new_locations

vec = GloVe(name='6B', dim=50)

Next, write a function that takes as arguments:
- a list of lists of tokens
- an instance of `Vectors` (such as our own GloVe)

And retrieves the corresponding embeddings as a tensor in the shape (N, L, D) where:
- N is the number of data points in a mini-batch
- L is the number of tokens in each sequence (they all have the same length now)
- D is the number of dimensions in each embedding vector (50 in our instance of GloVe)

In [3]:
import torch

def get_embeddings(tokens, vec):
    # Pad all lists so they have matching lengths
    # write your code here
    padded = fixed_length(tokens)
    
    # Retrieve embeddings from the Vector object using `get_vecs_by_tokens`
    # Make sure to get the shapes right, and concatenate the tensors so
    # the resulting shape is N, L, D
    # write your code here
    embeddings = torch.cat([vec.get_vecs_by_tokens(s).unsqueeze(0) for s in padded], dim=0)
    
    return embeddings

Just run the code below as is to inspect the shape of the embeddings:

In [None]:
embeddings = get_embeddings(tokens, vec)
embeddings.shape

There it is, the expected (N, L, D) shape. Let's take a quick look at the embeddings themselves. Just run the code below as is to visualize them:

In [None]:
embeddings

At the end of each tensor (in the first dimension, there are four of them), you'll see a bunch of zeros. These correspond to the padding tokens that are unknown to GloVe embeddings.

It looks like a waste of space and computation to handle all these zero embeddings, right? As it turns out, these can either be ignored (by using masks that identify which tokens are meaningful - more on that later), or they can be completely dismissed at a much earlier stage, which brings us to the second alternative.

#### 15.10.2.2 Alternative 2: Bag of Embeddings

The main purpose of padding sequences is to get matching lengths for all of them, after all, our models can only handle neatly organized tensors as inputs.

But, what if we could get a single, neatly organized, tensor directly out of the sequence? One way to accomplish this is to simply compute the embeddings for each token in a sequence, regardless of how long the sequence actually is, and then aggregate all these tensors together by averaging them. That's called a bag of embeddings (BoE), and PyTorch even offers a special layer for it (`nn.EmbeddingBag`) that does the whole thing.

The result, in this case, is a single tensor, with as many elements as the dimensionality of our vector (50, in the case of our GloVe), for each sentence. In this approach, it doesn't make sense to pad the sequences, otherwise we would be lowering the average by introducing a lot of zeros.

Let's try this approach out! First, we retrieve the embeddings corresponding to the tokens in a given sentence. Just run the code below as is:

In [None]:
embeddings = vec.get_vecs_by_tokens(tokens[0])
embeddings.shape

We'll get as many vectors back as there are tokens in the first sentence. Let's average them. Just run the code below as is to compute the average embedding for the sentence:

In [None]:
boe = embeddings.mean(axis=0)
boe.shape

That's it, a single tensor of average embeddings. Easy, right?

Now, write a function that takes as arguments:
- a list of lists of tokens
- an instance of `Vectors` (such as our own GloVe)

It must retrieve the embeddings for the tokens in each inner list, average them, and concatenate the results together, so the resulting tensor to be returned has the shape (N, D):

In [None]:
def get_bag_of_embeddings(tokens, vec):
    # Retrieve embeddings from the Vector object using `get_vecs_by_tokens`
    # For every list of tokens, take the average of their embeddings
    # Make sure to get the shapes right, and concatenate the tensors so
    # the resulting shape is N, D    
    # write your code here
    embeddings = torch.cat([vec.get_vecs_by_tokens(s).mean(axis=0).unsqueeze(0) for s in tokens], dim=0)
    
    return embeddings

Just run the code below as is to inspect the shape of the embeddings:

In [None]:
boe = get_bag_of_embeddings(tokens, vec)
boe.shape

The bag of embeddings is surely much more easy to handle, so we're sticking with that in this lab. Later on, when using larger models such as BERT, we'll to back to using the first alternative, including padding and masking.

### 15.10.2.3 Datapipes and Data Loaders

Moreover, recreate the "raw" datapipes and data loaders using a larger batch size this time. Don't forget to shuffle the training set.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step4.png)

In [None]:
datapipes = {}
# write your code here
datapipes['train'] = create_raw_datapipe('train.csv').shuffle(buffer_size=125000)
datapipes['test'] = create_raw_datapipe('test.csv')

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step5.png)

In [None]:
dataloaders = {}
# write your code here
dataloaders['train'] = DataLoader(dataset=datapipes['train'], batch_size=32, shuffle=True)
dataloaders['test'] = DataLoader(dataset=datapipes['test'], batch_size=32)

### 15.10.3 Training Loop

Before writing the training loop itself, you need to:
- create a model that's able to take a batch of bags of embeddings as inputs, and produce four logits as outputs (we suggest to keep it as simple as a single linear layer, but you're welcome to try more-complex models)
- create an appropriate loss function for multi-class classification
- create an optimizer to handle the model's parameters

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step1.png)

In [None]:
import torch
import torch.nn as nn

torch.manual_seed(11)
# write your code here
model = nn.Sequential(nn.Linear(50, 4))

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step2.png)

In [None]:
# write your code here
loss_fn = nn.CrossEntropyLoss()

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step3.png)

In [None]:
import torch.optim as optim

# Suggested learning rate
lr = 1e-3
# write your code here
optimizer = optim.Adam(model.parameters(), lr=lr)

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step4.png)

Finally, you may write the training loop. It is mostly the typical stuff we've done time and again, but remember that your mini-batches are tuples of `(labels, sentences)`, and you have to tokenize the sentences, and compute their corresponding bags of embeddings before feeding them to the model. You may leverage the functions you've already wrote to easily accomplish that.

In [None]:
vec = GloVe(name='6B', dim=50)

batch_losses = []
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model.to(device)

## Training
for i, batch in enumerate(dataloaders['train']):
    # Set the model's mode
    # write your code here
    model.train()

    # Unpack your batch (it has labels and sentences)
    # Tokenize the sentences, and compute their bags of embeddings
    # write your code here
    labels, sentences = batch
    tokens = tokenize_batch(sentences)
    embeddings = get_bag_of_embeddings(tokens, vec)

    embeddings = embeddings.to(device)
    labels = labels.to(device)

    # Step 1 - forward pass
    # write your code here
    predictions = model(embeddings)

    # Step 2 - computing the loss
    # write your code here
    loss = loss_fn(predictions, labels)
    
    # Step 3 - computing the gradients
    # write your code here
    loss.backward()
    
    batch_losses.append(loss.item())

    # Step 4 - updating parameters and zeroing gradients
    # write your code here
    optimizer.step()
    optimizer.zero_grad()

It shouldn't take long to train this model (if you followed our suggestion to keep it as simple as it can be, that is). Just run the code below as is to visualize the losses:

In [None]:
from matplotlib import pyplot as plt
plt.plot(batch_losses)

### 15.10.4 Evaluation

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step5.png)

Losses are looking ok-ish, how about actual metrics? Let's use HuggingFace's `evaluate` package once again. This time, though, we're loading each metric (precision, recall, and accuracy) separately because we're dealing with a multi-class classification task, and this doesn't sit well with the `combine()` method (at the time of writing). Just run the code below as is to create evaluators for the three metrics:

In [None]:
import evaluate

metric1 = evaluate.load('precision', average=None)
metric2 = evaluate.load('recall', average=None)
metric3 = evaluate.load('accuracy')

Write an evaluation loop that goes over the mini-batches in the test data pipe and:
- tokenizes the sentences
- retrieves their corresponding bags of embeddings
- get predictions from the model (logits)
- gets the most-likely class from the logits
- adds both predicted classes and labels to the metrics objects we've just created using their `add_batch()` method

In [None]:
model.eval()

for batch in dataloaders['test']:
    # Unpack your batch (it has labels and sentences)
    # Tokenize the sentences, and compute their bags of embeddings
    # write your code here
    labels, sentences = batch
    tokens = tokenize_batch(sentences)
    embeddings = get_bag_of_embeddings(tokens, vec)
        
    embeddings = embeddings.to(device)
    labels = labels.to(device)

    # write your code here
    predictions = model(embeddings)

    # write your code here
    pred_class = predictions.argmax(dim=1)
    
    pred_class = pred_class.tolist()
    labels = labels.tolist()

    metric1.add_batch(references=labels, predictions=pred_class)
    metric2.add_batch(references=labels, predictions=pred_class)
    metric3.add_batch(references=labels, predictions=pred_class)

Finally, call each metric's `compute()` object to get the results. Just run the code below as is to visualize the resulting metrics:

In [None]:
metric1.compute(average=None), metric2.compute(average=None), metric3.compute()

A single linear layer can achieve roughly 85% accuracy, which isn't bad at all! Even old, traditional, embeddings such as GloVe can lead to pretty decent results.