# Lab Instructions

In the lab, you're presented a task such as building a dataset, training a model, or writing a training loop, and we'll provide the code structured in such a way that you can fill in the blanks in the code using the knowledge you acquired in the chapters that precede the lab. You should be able to find appropriate snippets of code in the course content that work well in the lab with minor or no adjustments.

The blanks in the code are indicated by ellipsis (`...`) and comments (`# write your code here`).

In some cases, we'll provide you partial code to ensure the right variables are populated and any code that follows it runs accordingly.

```python
# write your code here
x = ...
```

The solution should be a single statement that replaces the ellipsis, such as:

```python
# write your code here
x = [0, 1, 2]
```

In some other cases, when there is no new variable being created, the blanks are shown like in the example below: 

```python
# write your code here
...
```

Although we're showing you only a single ellipsis (`...`), you may have to write more than one line of code to complete the step, such as:

```python
# write your code here
for i, xi in enumerate(x):
    x[i] = xi * 2
```

## Installation Notes

To run this notebook on Google Colab, you will need to install the following libraries: transformers, evaluate, and datasets.

In Google Colab, you can run the following command to install these libraries:

In [None]:
!pip install transformers evaluate datasets

## 15.10 Lab 6: Text Classification using Embeddings

It is time to get our hands dirty! Let's use GloVe pretrained word embeddings as features for a multi-class linear classification model. It works like a linear regression model, but it produces four logits as output (one for each class in the AG News Dataset), and we'll use the softmax function to convert the logits into probabilities.

### 15.10.1 Recap

In the last chapter, we loaded the AG News Dataset, cleaned it up of special characters and HTML tags, and discarded the title information, returning only labels and (cleaned) descriptions. Let's quickly retrace our steps here to prepare the dataset.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step1.png)

First, we need to download the dataset. You can dowload the files from the following links:
- `https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv`
- `https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv`
- `https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/classes.txt`

Alternatively, you can download all files as a single compressed file instead:

```
https://github.com/dvgodoy/assets/raw/main/PyTorchInPractice/data/AGNews/agnews.zip
```

If you're running Google Colab, you can download the files using the commands below:

In [None]:
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/test.csv
!wget https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/classes.txt

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step2.png)

Next, let's do some data cleaning, getting rid of a few HTML tags, replacing some special characters, etc. Here is a non-exhaustive list of characters and tags for replacement:

In [None]:
import numpy as np

chr_codes = np.array([
     36,   151,    38,  8220,   147,   148,   146,   225,   133,    39,  8221,  8212,   232,   149,   145,   233,
  64257,  8217,   163,   160,    91,    93,  8211,  8482,   234,    37,  8364,   153,   195,   169
])
chr_subst = {f' #{c};':chr(c) for c in chr_codes}
chr_subst.update({' amp;': '&', ' quot;': "'", ' hellip;': '...', ' nbsp;': ' ', '&lt;': '', '&gt;': '',
                  '&lt;em&gt;': '', '&lt;/em&gt;': '', '&lt;strong&gt;': '', '&lt;/strong&gt;': ''})

And here are a couple of helper functions we used to perform the cleanup:

In [None]:
def replace_chars(sent):
    to_replace = [c for c in list(chr_subst.keys()) if c in sent]
    for c in to_replace:
        sent = sent.replace(c, chr_subst[c])
    return sent

def preproc_description(desc):
    desc = desc.replace('\\', ' ').strip()
    return replace_chars(desc)

 After loading the CSV files using `load_dataset()` and building a `DatasetDict` out of them, we used the functions above to transform our datasets, cleaning up the text and converting the label into a 0-based numeric value:

In [108]:
from datasets import load_dataset, Split, DatasetDict

colnames = ['topic', 'title', 'news']

train_ds = load_dataset("csv", data_files='train.csv', sep=',', split=Split.ALL, column_names=colnames)
test_ds = load_dataset("csv", data_files='test.csv', sep=',', split=Split.ALL, column_names=colnames)

datasets = DatasetDict({'train': train_ds, 'test': test_ds})
datasets = datasets.map(lambda row: {'topic': row['topic']-1, 'news': preproc_description(row['news'])})
datasets = datasets.select_columns(['topic', 'news'])

Found cached dataset csv (/home/dvgodoy/.cache/huggingface/datasets/csv/default-da85ed8b4419fe03/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)
Found cached dataset csv (/home/dvgodoy/.cache/huggingface/datasets/csv/default-c48c1d4f77090254/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1)
Loading cached processed dataset at /home/dvgodoy/.cache/huggingface/datasets/csv/default-da85ed8b4419fe03/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-3055f2e528853c20.arrow
Loading cached processed dataset at /home/dvgodoy/.cache/huggingface/datasets/csv/default-c48c1d4f77090254/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1/cache-64cf7e744b4a333d.arrow


### 15.10.2 Tokenizing and Embedding

Let's plan ahead what needs to be done:
- create data loaders, one for each split
- write a function that tokenizes the sentences in a given batch
- write a function that converts tokens into token ids for every sentence in a given batch
- retrieve the word embeddings for each and every token
- create a linear model that takes the embedding vectors as features
- create the appropriate loss function and optimizer
- write a training loop

Create two data loaders, one for each split (training and validation/test). For now, use a small batch size, such as four, to be able to more easily peek at the values. Later on, you'll recreate the data loader with a more appropriate batch size.

In [None]:
from torch.utils.data import DataLoader

dataloaders = {}
# write your code here
dataloaders['train'] = ...
dataloaders['test'] = ...

Fetch one mini-batch of data to make sure it's working fine. Just run the code below as is to visualize the output:

In [None]:
batch = next(iter(dataloaders['train']))
labels, sentences = batch['topic'], batch['news']
labels, sentences

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step3.png)

Now, write a function that tokenizes a mini-batch of sentences. The function must take as arguments:
- a tuple or list containing multiple sentences (as returned by the data loader)
- an optional tokenizer: if the tokenizer isn't provided, it should fall back to the default  `simple_preprocess()` function we have used before

The function must return a list of lists of tokens.

In [None]:
from gensim.utils import simple_preprocess

def tokenize_batch(sentences, tokenizer=None):
    # Create the basic tokenizer if one isn't provided
    # write your code here
    ...
    
    # Tokenize sentences and returns the result
    # write your code here
    ...

Try your function out and assign its output to the `tokens` variable. Just run the code below as is to visualize the output:

In [None]:
tokens = tokenize_batch(sentences)
for v in tokens:
    print(v)

More likely than not, each sentence in a mini-batch has different number of tokens in it. How many tokens are there in each sentence? Just run the code below as is to see the answer:

In [None]:
[len(s) for s in tokens]

Now, let's briefly discuss two different approaches to handling this issue.

#### 15.10.2.1 Alternative 1: Padding

Did padding come to your mind? We have taken this approach time and again.

Now, you'll write a function called `encode_batch()` that combines both truncating and padding operations. The function must take as arguments:
- a vocabulary dictionary, mapping tokens/words to their corresponding indices
- a list of lists of tokens (as returned by the `tokenize_batch()` function)
- the maximum length of tokens, above which they are truncated
- an optional boolean argument indicating if the sequences should be padded
- an optional id for the padding token (e.g. -1)
- an optional id for the unknown token (e.g. -1)

The function must truncate sequences of tokens that are too long and, afterward, pad the sequences so the shorter ones match the length of the longest.

It must return a list of lists of token ids, every inner list having the same length.

We're loading Gensim's GloVe embeddings, so you may use its `key_to_index()` method as the vocabulary dictionary. You can also call the `encode_str()` function from Chapter 15 to convert words/tokens into their corresponding ids.

Perhaps you've also noticed that the default values for both padding and unknown tokens are the same. We'll keep them like that for now, but we'll assign them other values shortly.

In [137]:
from gensim import downloader

vec = downloader.load('glove-wiki-gigaword-50')

def encode_str(key_to_index, tokens, unk_token=-1):
    token_ids = [key_to_index.get(token, unk_token) for token in tokens]
    return token_ids

In [138]:
def encode_batch(key_to_index, batch, max_len=None, padding=False, pad_token_id=-1, unk_token_id=-1):
    # Truncate every sentence to max_len
    if isinstance(max_len, int):
        # write your code here
        truncated = ...
    else:
        truncated = batch[:]

    # Check the actual maximum length of the (truncated) inputs
    # write your code here
    ...
    
    batch_ids = []
    for tokens in truncated:
        # write your code here
        token_ids = ...
        
        if padding:
            # Appends as many padding tokens as necessary to make every
            # sentence as long as the actual maximum length
            # write your code here            
            ...
        batch_ids.append(token_ids)
    return batch_ids

In [None]:
print(encode_batch(vec.key_to_index, tokens, padding=True, max_len=50))

Double-check that every inner list has the same length, as expected. Just run the code below as is to visualize the output:

In [None]:
padded_token_ids = encode_batch(vec.key_to_index, tokens, padding=True)
lengths = [len(s) for s in padded_token_ids]
lengths

Same length everywhere? Great!

What if we try retrieving the embeddings for the padded sequences?

In [143]:
import torch
import torch.nn as nn

tensor_glove = torch.as_tensor(vec.vectors).float()
embedding = nn.Embedding.from_pretrained(tensor_glove)

def get_embeddings(embedding, token_ids):
    valid_ids = torch.as_tensor([token_id for token_id in token_ids if token_id >= 0])
    embedded_tokens = embedding(valid_ids)
    return embedded_tokens

In [144]:
print(len(padded_token_ids[0]))
print(padded_token_ids[0])
print(get_embeddings(embedding, padded_token_ids[0]).shape)

31
[39, 1765, 15697, 204, 0, 3429, 17, 85, 281, 19348, 207, 39, 1007, 163, 4364, 5, 2385, 4, 0, 16171, 5, 0, 20498, 46674, 5, 81, 3807, 12, 4214, -1, -1]
torch.Size([29, 50])


It shouldn't be a surprise that the embeddings are shorter than the padded sequence, after all, we are explicitly filtering out invalid ids in the `get_embeddings()` function. We have to make one small change to our embeddings to account for the possibility of padding and unknown tokens.

So, we are appending not one, but two tensors full of zeros to our embeddings, one for the padding token and another one for the unknown token. We're setting the pad token id to the index corresponding to the second-to-last entry in the embedding layer, and the unknown token to the entry after that. This way, the mappings (from word/token to index) are preserved, and the only difference is that we call the `encode_batch()` function using different padding and unknown token ids this time.

Run the code below to see an example:

In [146]:
tensor_glove = torch.as_tensor(vec.vectors).float()
tensor_glove = torch.cat([tensor_glove, torch.zeros((2, vec.vector_size))])

embedding = nn.Embedding.from_pretrained(tensor_glove)
# padding and unknown tokens are the last ones, so we don't mess with the key_to_index
pad_token_id = embedding.num_embeddings - 2
unk_token_id = pad_token_id + 1

padded_token_ids = encode_batch(vec.key_to_index, tokens, padding=True, pad_token_id=pad_token_id, unk_token_id=unk_token_id)

get_embeddings(embedding, padded_token_ids[0])

tensor([[ 0.7084, -0.5736,  0.1538,  ..., -0.2311, -0.3122, -0.3049],
        [ 0.4970, -0.3865,  0.5308,  ...,  0.0519,  0.3175,  0.1139],
        [ 0.4887, -0.7783, -0.1362,  ..., -0.7730, -0.3977, -1.0417],
        ...,
        [-0.4884, -0.1495, -0.6888,  ..., -0.7637,  0.1588, -0.1984],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]])

Cool, right? Now, we can retrieve embeddings using the same `get_embeddings()` function whether our sequences are padded or not.

What's next? Let's write a function that takes as arguments:
- an embedding layer
- a list of lists of token ids

And retrieves the corresponding embeddings for the whole batch as a tensor in the shape (N, L, D) where:
- N is the number of data points in a mini-batch
- L is the number of tokens in each sequence (they all have the same length now)
- D is the number of dimensions in each embedding vector (50 in our instance of GloVe)

In [None]:
def get_batch_embeddings(embedding, token_ids):
    # Retrieve embeddings from the embedding layer using the token ids
    # Make sure to get the shapes right, and concatenate the tensors so
    # the resulting shape is N, L, D
    # write your code here
    embeddings = ...
    return embeddings

Just run the code below as is to inspect the shape of the embeddings:

In [None]:
token_ids = encode_batch(vec.key_to_index, tokens, padding=True, pad_token_id=pad_token_id, unk_token_id=unk_token_id)
embeddings = get_batch_embeddings(embedding, token_ids)
embeddings.shape

There it is, the expected (N, L, D) shape. Let's take a quick look at the embeddings themselves. Just run the code below as is to visualize them:

In [None]:
embeddings

At the end of each tensor (in the first dimension, there are four of them), you'll see a bunch of zeros. These correspond to the padding tokens that we appended to GloVe embeddings.

It looks like a waste of space and computation to handle all these zero embeddings, right? As it turns out, these can either be ignored (by using masks that identify which tokens are meaningful - more on that later), or they can be completely dismissed at a much earlier stage, which brings us to the second alternative.

#### 15.10.2.2 Alternative 2: Bag of Embeddings

The main purpose of padding sequences is to get matching lengths for all of them, after all, our models can only handle neatly organized tensors as inputs.

But, what if we could get a single, neatly organized, tensor directly out of the sequence? One way to accomplish this is to simply compute the embeddings for each token in a sequence, regardless of how long the sequence actually is, and then aggregate all these tensors together by averaging them. That's called a bag of embeddings (BoE), and PyTorch even offers a special layer for it (`nn.EmbeddingBag`) that does the whole thing.

The result, in this case, is a single tensor, with as many elements as the dimensionality of our vector (50, in the case of our GloVe), for each sentence. In this approach, it doesn't make sense to pad the sequences, otherwise we would be lowering the average by introducing a lot of zeros.

Let's try this approach out! First, we retrieve the embeddings corresponding to the tokens in a given sentence. Just run the code below as is:

In [None]:
token_ids = encode_batch(vec.key_to_index, tokens, padding=False)
embeddings = get_embeddings(embedding, token_ids[0])

embeddings.shape

We'll get as many vectors back as there are tokens in the first sentence. Let's average them. Just run the code below as is to compute the average embedding for the sentence:

In [None]:
boe = embeddings.mean(axis=0)
boe.shape

That's it, a single tensor of average embeddings. Easy, right?

Now, write a function that takes as arguments:
- an embedding layer
- a list of lists of token ids

It must retrieve the embeddings for the tokens in each inner list, average them, and concatenate the results together, so the resulting tensor to be returned has the shape (N, D):

In [None]:
def get_bag_of_embeddings(embedding, token_ids):
    # Retrieve embeddings from the embedding layer using the token ids
    # For every list of tokens, take the average of their embeddings
    # Make sure to get the shapes right, and concatenate the tensors so
    # the resulting shape is N, D    
    # write your code here
    embeddings = ...
    return embeddings

Just run the code below as is to inspect the shape of the embeddings:

In [None]:
token_ids = encode_batch(vec.key_to_index, tokens, padding=False)
boe = get_bag_of_embeddings(embedding, token_ids)#, vec)
boe.shape

The bag of embeddings is surely much more easy to handle, so we're sticking with that in this lab. Later on, when using larger models such as BERT, we'll to back to using the first alternative, including padding and masking.

### 15.10.2.3 Data Loaders

Moreover, recreate the data loaders using a larger batch size this time.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step5.png)

In [None]:
dataloaders = {}
# write your code here
dataloaders['train'] = ...
dataloaders['test'] = ...

### 15.10.3 Training Loop

Before writing the training loop itself, you need to:
- create a model that's able to take a batch of bags of embeddings as inputs, and produce four logits as outputs (we suggest to keep it as simple as a single linear layer, but you're welcome to try more-complex models)
- create an appropriate loss function for multi-class classification
- create an optimizer to handle the model's parameters

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step1.png)

In [None]:
import torch
import torch.nn as nn

torch.manual_seed(11)
# write your code here
model = ...

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step2.png)

In [None]:
# write your code here
loss_fn = ...

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step3.png)

In [None]:
import torch.optim as optim

# Suggested learning rate
lr = 1e-3
# write your code here
optimizer = ...

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step4.png)

Finally, you may write the training loop. It is mostly the typical stuff we've done time and again, but remember that your mini-batches are dictionaries, and you have to tokenize and encode (that is, converting tokens into token ids) the sentences, and compute their corresponding bags of embeddings before feeding them to the model. You may leverage the functions you've already wrote to easily accomplish that.

In [None]:
vec = downloader.load('glove-wiki-gigaword-50')

tensor_glove = torch.as_tensor(vec.vectors).float()
# we don't need to bother appending zero tensors for padding and unknown tokens
# since we're using a bag of embeddings, that is, we simply average the valid
# tokens only and ignore the rest.
embedding = nn.Embedding.from_pretrained(tensor_glove)

batch_losses = []
device = 'cuda' if torch.cuda.is_available() else 'cpu'

model.to(device)

## Training
for i, batch in enumerate(dataloaders['train']):
    # Set the model's mode
    # write your code here
    ...

    # Unpack your batch (it has labels and sentences)
    # Tokenize and encode the sentences, and compute their bags of embeddings
    # write your code here
    ...
    embeddings = ...

    embeddings = embeddings.to(device)
    labels = labels.to(device)

    # Step 1 - forward pass
    # write your code here
    predictions = ...

    # Step 2 - computing the loss
    # write your code here
    loss = ...
    
    # Step 3 - computing the gradients
    # write your code here
    ...
    
    batch_losses.append(loss.item())

    # Step 4 - updating parameters and zeroing gradients
    # write your code here
    ...

It shouldn't take long to train this model (if you followed our suggestion to keep it as simple as it can be, that is). Just run the code below as is to visualize the losses:

In [None]:
from matplotlib import pyplot as plt
plt.plot(batch_losses)

### 15.10.4 Evaluation

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step5.png)

Losses are looking ok-ish, how about actual metrics? Let's use HuggingFace's `evaluate` package once again. This time, though, we're loading each metric (precision, recall, and accuracy) separately because we're dealing with a multi-class classification task, and this doesn't sit well with the `combine()` method (at the time of writing). Just run the code below as is to create evaluators for the three metrics:

In [None]:
import evaluate

metric1 = evaluate.load('precision', average=None)
metric2 = evaluate.load('recall', average=None)
metric3 = evaluate.load('accuracy')

Write an evaluation loop that goes over the mini-batches in the test data pipe and:
- tokenizes the sentences
- encode the sentences (convert their tokens into token ids)
- retrieves their corresponding bags of embeddings
- get predictions from the model (logits)
- gets the most-likely class from the logits
- adds both predicted classes and labels to the metrics objects we've just created using their `add_batch()` method

In [None]:
model.eval()

for batch in dataloaders['test']:
    # Unpack your batch (it has labels and sentences)
    # Tokenize and encode the sentences, and compute their bags of embeddings
    # write your code here
    ...
    embeddings = ...
        
    embeddings = embeddings.to(device)
    labels = labels.to(device)

    # write your code here
    predictions = ...

    # write your code here
    pred_class = ...
    
    pred_class = pred_class.tolist()
    labels = labels.tolist()

    metric1.add_batch(references=labels, predictions=pred_class)
    metric2.add_batch(references=labels, predictions=pred_class)
    metric3.add_batch(references=labels, predictions=pred_class)

Finally, call each metric's `compute()` object to get the results. Just run the code below as is to visualize the resulting metrics:

In [None]:
metric1.compute(average=None), metric2.compute(average=None), metric3.compute()

A single linear layer can achieve roughly 85% accuracy, which isn't bad at all! Even old, traditional, embeddings such as GloVe can lead to pretty decent results.