# Miscellanea

## Introduction

In this chapter we will go through various topics that did not find their own place in previous chapters.

## Multilabel classification

There is the following rough classification of classification tasks:

1. Binary - you assign to each sample one of two classes
2. Multiclass - you assign to each sample one of $N > 2$ classes
3. Multilabel - you assign to each sample a **subset** of $N$ classes, i.e. one sample can belong to multiple classes

## Multilabel classification

We have covered binary and multiclass classification but have not touched multilabel.
So let's do it now.

## Multilabel classification

The way to do multilabel classification with NNs in `Pytorch` is not much different then the multiclass case.

1. Instead of one encoding your target you encode it using a binary vector of lentgh $N$ (number of classes), where $1$ is in the $n$-th position if the sample is of class $n$ and $0$ otherwise.
2. Your model still outputs a $N$ dimensional vector.

## Multilabel classification

3. You use `nn.BCEWithLogitsLoss` loss function (binary cross-entropy with logit loss). This is sigmoid applied component-wise + cross-entropy.
4. When inferencing, to get class confidences, you apply sigmoid component-wise to the output.

## Multilabel classification

As an example, let's use this [dataset](https://huggingface.co/datasets/google-research-datasets/go_emotions).
It contains reddit comments and the goal is to predict what emotions the comment exhibits.
Of course, a tweet might exhibit multiple emotions, so this is a multilabel exercise.

Let's load the data first.

## Multilabel classification

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from spacy.lang.en import English
import torch
from torch.utils.data import Dataset, DataLoader

CLASS_LABELS = [
  'admiration', 'amusement', 'anger', 'annoyance',
  'approval', 'caring', 'confusion', 'curiosity',
  'desire', 'disappointment', 'disapproval', 'disgust',
  'embarrassment', 'excitement', 'fear', 'gratitude',
  'grief', 'joy', 'love', 'nervousness',
  'optimism', 'pride', 'realization', 'relief',
  'remorse', 'sadness', 'surprise', 'neutral'
]

TRAIN = pd.read_parquet("hf://datasets/google-research-datasets/go_emotions/simplified/train-00000-of-00001.parquet")
TEST = pd.read_parquet("hf://datasets/google-research-datasets/go_emotions/simplified/test-00000-of-00001.parquet")

TRAIN

  from .autonotebook import tqdm as notebook_tqdm


Unnamed: 0,text,labels,id
0,My favourite food is anything I didn't have to...,[27],eebbqej
1,"Now if he does off himself, everyone will thin...",[27],ed00q6i
2,WHY THE FUCK IS BAYLESS ISOING,[2],eezlygj
3,To make her feel threatened,[14],ed7ypvh
4,Dirty Southern Wankers,[3],ed0bdzj
...,...,...,...
43405,Added you mate well I’ve just got the bow and ...,[18],edsb738
43406,Always thought that was funny but is it a refe...,[6],ee7fdou
43407,What are you talking about? Anything bad that ...,[3],efgbhks
43408,"More like a baptism, with sexy results!",[13],ed1naf8


## Transformers for classification

Transformers can be used for classification.

Let's train a transformer for this multilabel classification task.
There are two modifications we need to make to adapt the transformer we used for text generation to a classification task.

## Transformers for classification

First, a transformer block produces a sequence of $e$ dimensional vectors, where $e$ is the embedding dimension.
What is usually done in classification is that at the end the output of transformer blocks is pooled over the sequence length dimension.
Doing this you get a tensor of shape $(b, e),$ where $b$ is the batch dimension.
You then project this tensor to a tensor of shape $(b, c),$ where $c$ is the number of classes, using a matrix.

## Transformers for classification

Second, we need to change the mask that we use to mask attention weights. If our true sequence length is $l$ (without padding) and the padded sequence length is $n$, then we can mask the weights in attention layers responsible for generating the last $n-l$ outputs. This mask essentially allows you to work with arbitrary length sequences that are bounded by some fixed length.

This mask improves performance significantly so you should always use it!

## Transformers for classification

First let's make a Dataset class for the comments.

In [2]:
from spacy.lang.en import English
import torch
from torch.utils.data import Dataset, DataLoader

class Dictionary:
  def __init__(self, min_count=10, init_tokens=None):
    self.nlp = English()
    self.min_count = min_count
    self.init_tokens = init_tokens
    self.i2t, self.t2i, self.no_tokens = self._default_maps()
    self.pad_idx = 0
    self.unk_idx = 1

  def _default_maps(self):
    # <pad> - token used for padding
    # <unk> - unknown, used for tokens not encountered in dictionary building
    i2t = ['<pad>', '<unk>']
    if self.init_tokens != None:
      i2t = [*i2t, *self.init_tokens]
    t2i = {token:index for index, token in enumerate(i2t)}
    return i2t, t2i, len(i2t)

  def build(self, corpus):
    tokens = {}
    for idx, row in enumerate(corpus):
      for token in self.nlp(row):
        if token.text.lower() not in tokens:
          tokens[token.text.lower()] = 1
        else:
          tokens[token.text.lower()] += 1
    i2t, _, _ = self._default_maps()
    self.i2t = [
      *i2t,
      *[token for token, count in tokens.items() if count >= self.min_count]
    ]
    self.t2i = {token:index for index, token in enumerate(self.i2t)}
    self.no_tokens = len(self.i2t)

  def string_to_idx(self, string, seq_length=None):
    tokens = [token.text.lower() for token in self.nlp(string) if not token.is_punct]
    return self.tokens_to_idx(tokens, seq_length)

  def tokens_to_idx(self, tokens, seq_length=None):
    idxs = [self.t2i[token] if token in self.t2i else self.unk_idx for token in tokens]
    if seq_length is not None:
      idxs = idxs + [self.pad_idx] * (seq_length - len(idxs))
      idxs = idxs[:seq_length]
    return idxs

  def idx_to_string(self, indices, ignore_pad=True):
    tokens = self.idx_to_tokens(indices, ignore_pad)
    return tokens.join(' ')

  def idx_to_tokens(self, indices, ignore_pad=True):
    if ignore_pad:
      return [self.i2t[idx] for idx in indices if idx != self.pad_idx]
    return [self.i2t[idx] for idx in indices]

class Comments(Dataset):
  def __init__(self, seq_length, train=False, dictionary=None):
    self.seq_length = seq_length

    if train:
      self.dataset = TRAIN
    else:
      self.dataset = TEST

    if dictionary is None:
      self.dictionary = Dictionary()
      self.dictionary.build(self.dataset["text"])
    else:
      self.dictionary = dictionary

    self.no_tokens = self.dictionary.no_tokens

  def __len__(self):
    return len(self.dataset)

  def __getitem__(self, idx):
    tokens = self.dictionary.string_to_idx(self.dataset.iloc[idx]["text"], seq_length=self.seq_length)
    tokens = torch.LongTensor(tokens)
    mask = ~(tokens == self.dictionary.pad_idx)

    target = self.dataset.iloc[idx]["labels"]
    target = torch.zeros(len(CLASS_LABELS), dtype=torch.float).scatter_(0, torch.tensor(target), value=1)
    return tokens, mask, target

seq_length = 50
batch_size = 128

train_data = Comments(
  seq_length=seq_length,
  train=True
)

test_data = Comments(
  seq_length=seq_length,
  train=False,
  dictionary=train_data.dictionary
)

train_dataloader = DataLoader(
  train_data,
  batch_size=batch_size,
  shuffle=True
)

test_dataloader = DataLoader(
  test_data,
  batch_size=batch_size,
  shuffle=False
)

## Transformers for classification

Let's build the model.

In [None]:
import torch
from torch import nn

class TransformerClassifier(nn.Module):
  def __init__(self, no_classes, seq_length, no_tokens, embed_dim, no_heads, depth):
    super().__init__()
    self.embed_dim = embed_dim
    self.no_heads = no_heads
    self.depth = depth

    self.token_embedding = nn.Embedding(embedding_dim=embed_dim, num_embeddings=no_tokens)
    self.pos_embedding = nn.Embedding(embedding_dim=embed_dim, num_embeddings=seq_length)

    self.tblocks = nn.ModuleList([
      nn.TransformerEncoderLayer(d_model=embed_dim, nhead=no_heads, dim_feedforward=3072, batch_first=True) # Implements GPT style transformer block
    ])

    self.toprobs = nn.Linear(embed_dim, no_classes)

  def forward(self, x, mask):
    tokens = self.token_embedding(x)
    
    b, n, e = tokens.size()
    positions = self.pos_embedding(torch.arange(n, device=tokens.device)).unsqueeze(0).expand(b, n, e)
    x = tokens + positions

    for tblock in self.tblocks:
      x = tblock(x, src_key_padding_mask=mask)

    x, _ = torch.max(x, dim=1) # Max pool across the seq_length dimension, conveniently this also
                               # removes the seq_length dimension from the tensor so we do not need to flatten

    return self.toprobs(x)

  def predict(self, x, mask):
    # We can abuse the fact that if x > 0 then sigmoid(x) > 0.5
    x = self.forward(x, mask)
    return torch.heaviside(x, torch.tensor(0, dtype=torch.float32)) # Returns 1 if x > 0 and 0 if x <= 0

## Transformers for classification

Let's copy over the model training code. Note that we edit it a bit to accomodate the mask and also to account for the multilabel task.

## Transformers for classification

In [4]:
from tqdm import tqdm
import sys

def train_epoch(dataloader, model, loss_fn, optimizer):
  model.train() # Set model to training mode

  total_loss = 0
  total_batches = 0

  with tqdm(dataloader, unit="batch", file=sys.stdout) as ep_tqdm:
    ep_tqdm.set_description("Train")
    for X, mask, y in ep_tqdm:
      X, mask, y = X.to(device), mask.to(device), y.to(device)

      # Forward pass
      pred = model(X, mask)
      loss = loss_fn(pred, y)
        
      # Backward pass
      loss.backward()
      optimizer.step()

      # Reset the computed gradients back to zero
      optimizer.zero_grad()

      # Output stats
      total_loss += loss
      total_batches += 1
      ep_tqdm.set_postfix(average_batch_loss=(total_loss/total_batches).item())

def eval_epoch(dataloader, model, loss_fn):
  model.eval() # Set model to inference mode
  
  total_loss = 0
  total_batches = 0

  with torch.no_grad(): # Do not compute gradients
    with tqdm(dataloader, unit="batch", file=sys.stdout) as ep_tqdm:
      ep_tqdm.set_description("Val")
      for X, mask, y in ep_tqdm:
        X, mask, y = X.to(device), mask.to(device), y.to(device)
        pred = model(X, mask)

        total_loss += loss_fn(pred, y)
        total_batches += 1

        ep_tqdm.set_postfix(average_batch_loss=(total_loss/total_batches).item())

## Transformers for classification

Now we can train the model! Remember to use the appropriate loss function for multilabel tasks.

In [5]:
#| output-location: slide
# Hyperparameters
learning_rate = 0.0001
epochs = 20

device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"Using {device} device")

model = TransformerClassifier(len(CLASS_LABELS), seq_length, train_data.no_tokens, 1024, 16, 6).to(device)

loss_fn = nn.BCEWithLogitsLoss().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Organize the training loop
for t in range(epochs):
  print(f"Epoch {t+1}\n-------------------------------")
  train_epoch(train_dataloader, model, loss_fn, optimizer)
  eval_epoch(test_dataloader, model, loss_fn)

print("Done!")

Using cuda device
Epoch 1
-------------------------------
Train: 100%|██████████| 340/340 [00:17<00:00, 19.56batch/s, average_batch_loss=0.148]
Val: 100%|██████████| 43/43 [00:01<00:00, 32.53batch/s, average_batch_loss=0.125]
Epoch 2
-------------------------------
Train: 100%|██████████| 340/340 [00:17<00:00, 19.81batch/s, average_batch_loss=0.116]
Val: 100%|██████████| 43/43 [00:01<00:00, 34.09batch/s, average_batch_loss=0.11] 
Epoch 3
-------------------------------
Train: 100%|██████████| 340/340 [00:17<00:00, 19.78batch/s, average_batch_loss=0.106]
Val: 100%|██████████| 43/43 [00:01<00:00, 33.78batch/s, average_batch_loss=0.104]
Epoch 4
-------------------------------
Train: 100%|██████████| 340/340 [00:17<00:00, 19.93batch/s, average_batch_loss=0.1]  
Val: 100%|██████████| 43/43 [00:01<00:00, 33.76batch/s, average_batch_loss=0.102]
Epoch 5
-------------------------------
Train: 100%|██████████| 340/340 [00:17<00:00, 19.75batch/s, average_batch_loss=0.0961]
Val: 100%|██████████| 4

## Transformers for classification

Let's compute precision and recall.

In [None]:
#| output-location: slide
from sklearn.metrics import classification_report

def compute_classification_report(model, dataloader):
  with torch.no_grad():
    y_pred = []
    y_true = []
    for X, mask, y in dataloader:
      X, mask, y = X.to(device), mask.to(device), y.to(device)
      y_pred = [*y_pred, *model.predict(X, mask).cpu()]
      y_true = [*y_true, *y.cpu()]
    print(classification_report(y_true, y_pred, target_names=CLASS_LABELS, zero_division=0))

compute_classification_report(model, test_dataloader)

                precision    recall  f1-score   support

    admiration       0.68      0.51      0.58       504
     amusement       0.77      0.75      0.76       264
         anger       0.52      0.27      0.36       198
     annoyance       0.43      0.15      0.22       320
      approval       0.40      0.25      0.31       351
        caring       0.35      0.17      0.23       135
     confusion       0.35      0.24      0.29       153
     curiosity       0.42      0.25      0.31       284
        desire       0.54      0.25      0.34        83
disappointment       0.36      0.13      0.19       151
   disapproval       0.40      0.08      0.13       267
       disgust       0.59      0.38      0.46       123
 embarrassment       0.44      0.19      0.26        37
    excitement       0.45      0.27      0.34       103
          fear       0.70      0.49      0.58        78
     gratitude       0.94      0.91      0.93       352
         grief       0.00      0.00      0.00  

## Early stopping

If you check the training statistics you will see that the model overfit.

It is a good idea to stop training when your model starts to overfit, i.e. the performance degrades on the validation set.
This is called early stopping.

To implement early stopping in `pytorch` we first need to make our model evaluation function output the statistics.

## Early stopping

In [7]:
def eval_epoch(dataloader, model, loss_fn):
  model.eval() # Set model to inference mode
  
  total_loss = 0
  total_batches = 0

  with torch.no_grad(): # Do not compute gradients
    with tqdm(dataloader, unit="batch", file=sys.stdout) as ep_tqdm:
      ep_tqdm.set_description("Val")
      for X, mask, y in ep_tqdm:
        X, mask, y = X.to(device), mask.to(device), y.to(device)
        pred = model(X, mask)

        total_loss += loss_fn(pred, y)
        total_batches += 1

        ep_tqdm.set_postfix(average_batch_loss=(total_loss/total_batches).item())
  
  return (total_loss/total_batches).item()

## Early stopping

To decide whether we should stop early we are going to need to keep some extra state.
So the cleanest solution would be to have an extra object that monitors training and decides whether we should stop.

## Early stopping

In [8]:
class EarlyStopper:
  def __init__(self, patience=1, threshold=0):
    self.patience = patience
    self.annoyance = 0

    self.threshold = threshold
    self.best_epoch = 0
    self.min_loss = 99999999999

  def should_early_stop(self, loss, epoch):
    if loss < self.min_loss:
      self.min_loss = loss
      self.annoyance = 0
      self.best_epoch = epoch
    elif loss > (self.min_loss + self.threshold):
      self.annoyance += 1
      if self.annoyance >= self.patience:
        return True
    return False

## Early stopping

Next we update our training routine to monitor overfitting.
We also save our model every epoch so that we could go back to the best iteration.

Saving your model every training epoch is called model checkpointing and its a good thing to do in general.
For example, if the machine you are training on decides to crash you will not loose all your progress if you saved some checkpoints.

## Early stopping

In [9]:
#| output-location: slide
import os

# Create a directory for checkpoints
if not os.path.exists("./checkpoints"):
  os.makedirs("./checkpoints")

# Hyperparameters
learning_rate = 0.0001
epochs = 20

device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"Using {device} device")

model = TransformerClassifier(len(CLASS_LABELS), seq_length, train_data.no_tokens, 768, 12, 6).to(device)

loss_fn = nn.BCEWithLogitsLoss().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
stopper = EarlyStopper(patience=2, threshold=0.001)

# Organize the training loop
for t in range(epochs):
  print(f"Epoch {t+1}\n-------------------------------")
  train_epoch(train_dataloader, model, loss_fn, optimizer)
  loss = eval_epoch(test_dataloader, model, loss_fn)
  torch.save(model, f"./checkpoints/model_epoch_{t+1}.pth")

  if stopper.should_early_stop(loss, t+1):
    print("Stopping early")
    break

print(f"Loading checkpoint for best epoch {stopper.best_epoch}")
model = torch.load(f"./checkpoints/model_epoch_{stopper.best_epoch}.pth", weights_only=False)
print("Done!")

Using cuda device
Epoch 1
-------------------------------
Train: 100%|██████████| 340/340 [00:13<00:00, 25.21batch/s, average_batch_loss=0.154]
Val: 100%|██████████| 43/43 [00:01<00:00, 41.25batch/s, average_batch_loss=0.134]
Epoch 2
-------------------------------
Train: 100%|██████████| 340/340 [00:13<00:00, 25.35batch/s, average_batch_loss=0.123]
Val: 100%|██████████| 43/43 [00:01<00:00, 41.09batch/s, average_batch_loss=0.115]
Epoch 3
-------------------------------
Train: 100%|██████████| 340/340 [00:13<00:00, 25.54batch/s, average_batch_loss=0.11] 
Val: 100%|██████████| 43/43 [00:01<00:00, 42.58batch/s, average_batch_loss=0.107]
Epoch 4
-------------------------------
Train: 100%|██████████| 340/340 [00:13<00:00, 25.40batch/s, average_batch_loss=0.104]
Val: 100%|██████████| 43/43 [00:01<00:00, 40.71batch/s, average_batch_loss=0.103]
Epoch 5
-------------------------------
Train: 100%|██████████| 340/340 [00:13<00:00, 25.56batch/s, average_batch_loss=0.0993]
Val: 100%|██████████| 4

## Early stopping

Let's compute precision and recall again.

In [10]:
compute_classification_report(model, test_dataloader)

                precision    recall  f1-score   support

    admiration       0.64      0.57      0.61       504
     amusement       0.75      0.82      0.78       264
         anger       0.62      0.19      0.29       198
     annoyance       0.44      0.15      0.22       320
      approval       0.57      0.19      0.29       351
        caring       0.41      0.13      0.20       135
     confusion       0.45      0.07      0.11       153
     curiosity       0.48      0.18      0.26       284
        desire       0.58      0.23      0.33        83
disappointment       0.50      0.01      0.01       151
   disapproval       0.37      0.06      0.10       267
       disgust       0.65      0.27      0.38       123
 embarrassment       0.00      0.00      0.00        37
    excitement       0.54      0.26      0.35       103
          fear       0.76      0.37      0.50        78
     gratitude       0.93      0.90      0.91       352
         grief       0.00      0.00      0.00  

## Balancing class weights

Now let's return to a multiclass classification problem.

Very often the distribution of the labels in your training dataset will not match the distribution that you will encouter in the "wild" (i.e. production).
This happens due to a variety of factors.
For example, it might be that indentifying one specific label is much easier then the rest.
So, over time your training set will gather unproportionally more examples of that label.

## Balancing class weights

The problem is that the model will learn this incorrect distribution. For example, if 90% of your training set is made up of one label then the model will be very trigger happy when assigning that label.

In practice, if the distribution of your training set is skewed, it is better to rebalance the distribution such that it is uniform.
That is, make the model learn that each label is as likely as the next one.

## Balancing class weights

This can be done by balancing the class weights.
That is, you assign a bigger weight to a sample if its label is more rare in the training dataset.

The formula you can use is
$$
  \text{weight}_i = \frac{\text{total no of samples}}{\text{no of classes}\times\text{no of samples of class i}}.
$$

## Balancing class weights

Note that doing this will probably reduce the performance of the model on your validation and test datasets (if these have the same wrong distribution), however the point is that it will improve performance in production.

## Balancing class weights

In `sklearn`, models have a `class_weight` parameter, which you can set to `'balanced'` to apply the above formula.

Here is one way to do this in `pytorch`:

## Balancing class weights

```
def compute_class_weights(dataloader):
  classes = []
  for _, _, y in dataloader:
    classes = [*classes, *y.argmax(dim=1)]
  
  no_classes = len(np.unique(classes))
  no_total = len(classes)
  no_in_class = np.unique_counts(classes).counts
  return torch.tensor(no_total/(no_classes*no_in_class))

weights = compute_class_weights(some_dataloader)
loss_fn = nn.CrossEntropyLoss(weight=class_weights).to(device)
```

## Learning rate scheduling and warmup

Neural networks with many layers might have trouble converging when you start training on fresh random weights.
I.e. if you had weights that were approximately correct the training would converge, however when you start with random weights training diverges.

There is a standard technique for mitigating this called warmup. The idea is that you start training with a very low learning rate for the first few epochs and then crank the learning rate back up to a normal level.

Our model is not really deep enough to benefit from warmup, but we can still checkout how to implement it in `pytorch`.

## Learning rate scheduling and warmup

Also, if you are using SGD to train it might be a good idea to periodically reduce the learning rate as you are training. This is not that necessary when using Adam or its variants.

In general, if you are tweaking the learning rate during training this is called learning rate scheduling.

## Learning rate scheduling and warmup

In [12]:
#| output-location: slide
import os

# Create a directory for checkpoints
if not os.path.exists("./checkpoints"):
  os.makedirs("./checkpoints")

# Hyperparameters
learning_rate = 0.0001
epochs = 20

device = torch.accelerator.current_accelerator().type if torch.accelerator.is_available() else "cpu"
print(f"Using {device} device")

model = TransformerClassifier(len(CLASS_LABELS), seq_length, train_data.no_tokens, 1024, 16, 6).to(device)

loss_fn = nn.BCEWithLogitsLoss().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
scheduler = torch.optim.lr_scheduler.ConstantLR(optimizer, factor=0.1, total_iters=5) # Multiplies the LR by factor for total_iters iterations, there are other schedulers available in Pytorch
stopper = EarlyStopper(patience=2, threshold=0.1)

# Organize the training loop
for t in range(epochs):
  print(f"Epoch {t+1}\n-------------------------------")
  train_epoch(train_dataloader, model, loss_fn, optimizer)
  loss = eval_epoch(test_dataloader, model, loss_fn)
  torch.save(model, f"./checkpoints/model_epoch_{t+1}.pth")
  scheduler.step()

  if stopper.should_early_stop(loss, t+1):
    print("Stopping early")
    break

print(f"Loading checkpoint for best epoch {stopper.best_epoch}")
model = torch.load(f"./checkpoints/model_epoch_{stopper.best_epoch}.pth", weights_only=False)
print("Done!")

Using cuda device
Epoch 1
-------------------------------
Train: 100%|██████████| 340/340 [00:17<00:00, 19.71batch/s, average_batch_loss=0.178]
Val: 100%|██████████| 43/43 [00:01<00:00, 33.97batch/s, average_batch_loss=0.147]
Epoch 2
-------------------------------
Train: 100%|██████████| 340/340 [00:17<00:00, 19.48batch/s, average_batch_loss=0.148]
Val: 100%|██████████| 43/43 [00:01<00:00, 33.55batch/s, average_batch_loss=0.145]
Epoch 3
-------------------------------
Train: 100%|██████████| 340/340 [00:17<00:00, 19.54batch/s, average_batch_loss=0.146]
Val: 100%|██████████| 43/43 [00:01<00:00, 33.15batch/s, average_batch_loss=0.143]
Epoch 4
-------------------------------
Train: 100%|██████████| 340/340 [00:17<00:00, 19.25batch/s, average_batch_loss=0.142]
Val: 100%|██████████| 43/43 [00:01<00:00, 33.09batch/s, average_batch_loss=0.138]
Epoch 5
-------------------------------
Train: 100%|██████████| 340/340 [00:17<00:00, 19.37batch/s, average_batch_loss=0.137]
Val: 100%|██████████| 43

## Dropout

There is a standard way of reducing overfitting in neural networks called dropout.
The idea of dropout is that you randomly kill some inputs to a layer during training.
So the model cannot focus on using specific weights during training and therefore is forced to train all weights.

![](../images/dropout.png){fig-align="center"}

## Dropout

`nn.TransformerEncoderLayer` already adds dropout by default with probabily to kill an input with probability $0.1$. You can control this using the `dropout` parameter.

You can add dropout to your `Pytorch` model using [nn.Dropout](https://docs.pytorch.org/docs/stable/generated/torch.nn.Dropout.html#dropout).

## Data augmentation

The more data you have the better.
However, getting more data to train your model might be expensive.
Instead you can try generating synthetic data out of the data that you already have.
This is called data augmentation.

Of course getting completely novel data is better, but adding synthetic data can also boost the performance of your model with almost no overhead cost.

## Data augmentation

If you are dealing with images you can try:

1. Cropping
2. Flipping
3. Zooming
4. Rotation
5. Hue adjustment

to get images that are slightly different then the original.

## Data augmentation

For example, if you are doing image classification then performing the above operations will not change the class the image is in (unless you go very wild), so you will have a new sample of the class.

## Data augmentation

In image classification there is also an interesting data augmentation technique called mixup.
Suppose matrices $A$ and $B$ represent your images.
Then take some $t \in (0, 1)$ and create a new image $C = tA+(1-t)B$.
The class of $C$ will also be a linear combination of the classes of $A$ and $B$, that is if the one hot encoded labels of $A$ and $B$ are $y_A$ and $y_B$, then $y_C = ty_A+(1-t)y_B$.

You can also mixup three or more images.

## Data augmentation

You can also augment natural text, for example you can try:

1. Replacing words with synonyms
2. Removing random words from sentences
3. Adding random words to sentences
4. Machine translating to a different language and then translating back

## Data augmentation

There are two ways of implementing data augmentation in your training pipeline:

1. Offline - you augment the data before training
2. Online - you perform data augmentation during training on a batch before feeding it to the model

Be aware that in NLP if your data augmentation technique has the possibility of adding new tokens to the dataset then you can't do it in an online way.

## Data augmentation

This is not strictly data augmentation, but in NLP nowadays you can generate new labels using LLMs.

For example if you are doing text classification and have a bunch of data that is not labelled then you can get a LLM to label some of it.

If you write a good prompt then the accuracy of those labels will probably be around the same as you would get if you paid some company to do mass labelling for you.
At least this is the case from my own personal experience.

## Extra tools

Here are some extra tools that are worth looking at but we will not cover in this course:

1. [LangChain](https://github.com/langchain-ai/langchain) - a framework for working with LLMs programatically.
2. [Apache Beam](https://beam.apache.org/) - a thing that helps you write parallel data processing pipelines.
3. Also getting used to Linux might be useful at some point.

## Practice task

1. Start working on your homework project if you have not already!