# Tiny Transformer Classifier from Scratch
The objective of this notebook is to train a LLM from scratch on a small dataset for text classification. The model will be trained directly on the *text classification* task. The main topics we will be working on:
1. Data preparation
2. Building the LLM architecture
3. Training an LLM

We start with the usual library imports.

In [None]:
import torch
import torch.nn as nn

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

## Encoder-only architecture
Since we are training a *transformer model* for the task of text classification, an *encoder-only* transformer will suffice. This will allow us to focus on the content seen in today's lecture without having to worry about the decoder part.

Our encoder-only model will be constituted of different modules and sub-modules, namely:
1. An *embedding module*, formed by:
  - An *input embedding* sub-module.
  - A *positional encoding* sub-module.
2. A *transformer encoder block*, formed by:
  - A *mulit-head (self) attention* sub-module.
  - A couple of *Layer Norm* layers.
  - A *feed forward* sub-module.
  - *Residual connections*.
3. A *classification head*, formed by a linear layer.

Next we will implement the main building blocks of the *Transformer encoder block*.

### Multi-Head Attention
The *multi-head* attention layer takes the input and processes it in chunks of equal length through eahc of its different heads, by using the attention mechanism, i.e. computing the queries, keys and values, and the attention scores and weights from them.

**Exercise.** Implement the `MultiHeadAttention` by filling in the `TODO` flags below.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, hidden_dim, num_heads):
        super().__init__()
        # TODO: use the 'assert' statement to check that d_out is divisible by num_heads

        self.hidden_dim = hidden_dim
        self.num_heads = num_heads
        self.head_dim = ...  # TODO: Compute the per-head hidden dimension

        self.W_query = ... # TODO: initilize the linear layer for the queries with the appropriate dimensions
        self.W_key = ... # TODO: initilize the linear layer for the keys with the appropriate dimensions
        self.W_value = ... # TODO: initilize the linear layer for the values with the appropriate dimensions

        self.W_out = ... # TODO: initilize the linear layer for the output with the appropriate dimensions

    def forward(self, x):
        batch_size, num_tokens, hidden_dim = x.shape
        assert hidden_dim == self.hidden_dim, f"hidden_dim must be {self.hidden_dim}"

        keys = ... # TODO: apply the queries layer to the input # Shape: (batch_size, num_tokens, hidden_dim)
        queries = ... # TODO: apply the keys layer to the input # Shape: (batch_size, num_tokens, hidden_dim)
        values = ... # TODO: apply the values layer to the input # Shape: (batch_size, num_tokens, hidden_dim)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (batch_size, num_tokens, d_out) -> (batch_size, num_tokens, num_heads, head_dim)
        keys = keys.view(batch_size, num_tokens, self.num_heads, self.head_dim)
        values = values.view(batch_size, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(batch_size, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (batch_size, num_tokens, num_heads, head_dim) -> (batch_size, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention)
        attn_scores = torch.einsum("bijk, bikl -> bijl", queries, keys.transpose(2, 3))
        attn_weights = torch.softmax(attn_scores / (self.head_dim**0.5), dim=-1)

        # Attention output
        attn_output = torch.einsum("bijk, bikl -> bijl", attn_weights, values)

        # Transpose back: (batch_size, num_heads, num_tokens, head_dim) -> (batch_size, num_tokens, num_heads, head_dim)
        attn_output = attn_output.transpose(1, 2)

        # Concatenate heads: (batch_size, num_tokens, num_heads, head_dim) -> (batch_size, num_tokens, hiddend_dim)
        attn_output = attn_output.reshape(batch_size, num_tokens, self.hidden_dim)

        # Compute output
        output = self.W_out(attn_output)

        return output

**Solution.** Click below to check the solution.

In [None]:
# @title
class MultiHeadAttention(nn.Module):
    def __init__(self, hidden_dim, num_heads):
        super().__init__()
        assert hidden_dim % num_heads == 0, "hidden_dim must be divisible by num_heads"

        self.hidden_dim = hidden_dim
        self.num_heads = num_heads
        self.head_dim = hidden_dim // num_heads  # Compute the per-head hidden dimension

        self.W_query = nn.Linear(hidden_dim, hidden_dim) # queries weight matrix
        self.W_key = nn.Linear(hidden_dim, hidden_dim) # keys weight matrix
        self.W_value = nn.Linear(hidden_dim, hidden_dim) # values weight matrix

        self.W_out = nn.Linear(hidden_dim, hidden_dim)  # output weight matrix

    def forward(self, x):
        batch_size, num_tokens, hidden_dim = x.shape
        assert hidden_dim == self.hidden_dim, f"hidden_dim must be {self.hidden_dim}"

        keys = self.W_key(x)  # Shape: (batch_size, num_tokens, hidden_dim)
        queries = self.W_query(x) # Shape: (batch_size, num_tokens, hidden_dim)
        values = self.W_value(x) # Shape: (batch_size, num_tokens, hidden_dim)

        # We implicitly split the matrix by adding a `num_heads` dimension
        # Unroll last dim: (batch_size, num_tokens, d_out) -> (batch_size, num_tokens, num_heads, head_dim)
        keys = keys.view(batch_size, num_tokens, self.num_heads, self.head_dim)
        values = values.view(batch_size, num_tokens, self.num_heads, self.head_dim)
        queries = queries.view(batch_size, num_tokens, self.num_heads, self.head_dim)

        # Transpose: (batch_size, num_tokens, num_heads, head_dim) -> (batch_size, num_heads, num_tokens, head_dim)
        keys = keys.transpose(1, 2)
        queries = queries.transpose(1, 2)
        values = values.transpose(1, 2)

        # Compute scaled dot-product attention (aka self-attention)
        attn_scores = torch.einsum("bijk, bikl -> bijl", queries, keys.transpose(2, 3))
        attn_weights = torch.softmax(attn_scores / (self.head_dim**0.5), dim=-1)

        # Attention output
        attn_output = torch.einsum("bijk, bikl -> bijl", attn_weights, values)

        # Transpose back: (batch_size, num_heads, num_tokens, head_dim) -> (batch_size, num_tokens, num_heads, head_dim)
        attn_output = attn_output.transpose(1, 2)

        # Concatenate heads: (batch_size, num_tokens, num_heads, head_dim) -> (batch_size, num_tokens, hiddend_dim)
        attn_output = attn_output.reshape(batch_size, num_tokens, self.hidden_dim)

        # Compute output
        output = self.W_out(attn_output)

        return output

### Layer Norm
You might have heard of BatchNorm, where the inputs of a layer are normalized across the batch: the mean and standard deviation of the inputs are computed accross the batch dimension, then the inputs are normalized by substracting the mean and dividing by the standard deviation.

The *LayerNorm* normalization technique is similar, only the normalization happens accross the feature dimension rather than the batch dimension.

Once the feature mean $\mu$ and standard deviation $\sigma$ are computed, the inputs are normalized as follows:
$$\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}},$$
where $\epsilon$ is a small constant typically taken to be equal to $10^{-5}$ to avoid division by numbers close to zero.

The output of the `LayerNorm` is not $\hat{x}_i$ though, but
$$y_i = \gamma \hat{x}_i + \beta$$
where $\gamma$ (scale parameter) and $\beta$ (shift parameter) are learnable parameters.

**Exercise.** Complete the `LayerNorm` class below by computing the feature mean and variance and performing the appropriate normalization.

In [None]:
class LayerNorm(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.eps = 1e-5 # small value to avoid division by zero
        self.scale = nn.Parameter(torch.ones(hidden_dim)) # scale parameter (learnable)
        self.shift = nn.Parameter(torch.zeros(hidden_dim)) # shift parameter (learnable)

    def forward(self, x):
        mean = ... # TODO: compute the mean of the input tensor over the last dimension
        var = ... # TODO: compute the variance of the input tensor over the last dimension
        norm_x = ... # TODO: normalize the input tensor
        y = ... # TODO: apply the learned scale and shift parameters
        return y

**Solution.** Click below to check the solution.

In [None]:
# @title
class LayerNorm(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.eps = 1e-5 # small value to avoid division by zero
        self.scale = nn.Parameter(torch.ones(hidden_dim)) # scale parameter (learnable)
        self.shift = nn.Parameter(torch.zeros(hidden_dim)) # shift parameter (learnable)

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        y = self.scale * norm_x + self.shift
        return y

### Feed-Forward Network
The *feed forward* sub-modules in our *transformer blocks* will consist of:
1. A linear layer, the output dimension being 4 times `hidden_dim`.
2. A ReLU activation
3. A linear layer, the output dimension being `hidden_dim`.

**Exercise.** Work out the input dimension of the two linear layers, and complete the `FeedForward` class by filling in the `TODO` flags.

In [None]:
class FeedForward(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        # TODO: add the two linear layers with the intermediate ReLU activation function to the following sequential NN
        self.layers = nn.Sequential(
            ... # TODO: add the layers
        )

    def forward(self, x):
        return self.layers(x)

**Solution.** Click below to check the solution.

In [None]:
# @title
class FeedForward(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim * 4),
            nn.ReLU(),
            nn.Linear(hidden_dim * 4, hidden_dim)
        )

    def forward(self, x):
        return self.layers(x)

We are done with the three main building blocks that form the *transformer encoder block*. The next steps are to implement the *transformer encoder block* itself as well as the *embedding sub-module*. Once those two sub-modules will be implemented, we will be ready to implement the final *end-to-end* transformer classifier model.


### Transformer Encoder Block
As we have mentioned earlier, the *Transformer encoder block* is composed of:
1. A *Multi-head attention* sub-module
2. A *LayerNorm* normalization
3. A *Feed Forward* sub-module
4. A *LayerNorm* normalization

Moreover, two skip-connections are present in the *Transformer Block*:
- *(A)* A skip connection that adds the original input to the output of the operation *1* above.
- *(B)* A skip connection that adds the output of the operation *2* to the output of the operation *3* above.

**Exercise.** Implement the `TransformerEncoderBlock` module below by completing the `TODO` tags.

In [None]:
class TransformerEncoderBlock(nn.Module):
  def __init__(self, hidden_dim, num_heads):
    super(TransformerEncoderBlock, self).__init__()

    self.attention = ... # TODO: initialize the multi-head attention module with the appropriate dimensions
    self.norm1 = ... # TODO: initialize the first Layer Norm with the appropriate dimensions
    self.norm2 = ... # TODO: initialize the second Layer Norm with the appropriate dimensions
    self.feed_forward = ... # TODO: initialize the Feed Forward module with the appropriate dimensions

  def forward(self, x):
      # TODO: implement the forward pass of the TransformerEncoderBLock
      return x

**Solution.** Click below to check the solution.

In [None]:
# @title
class TransformerEncoderBlock(nn.Module):
  def __init__(self, hidden_dim, num_heads):
    super(TransformerEncoderBlock, self).__init__()

    self.attention = MultiHeadAttention(hidden_dim, num_heads)
    self.norm1 = nn.LayerNorm(hidden_dim)
    self.norm2 = nn.LayerNorm(hidden_dim)
    self.feed_forward = FeedForward(hidden_dim)

  def forward(self, x):
      attn_output = self.attention(x)
      x = x + attn_output
      x = self.norm1(x)
      ff_output = self.feed_forward(x)
      x = x + ff_output
      x = self.norm2(x)
      return x

### Embedding
As we have mentioned above, the *embedding sub-module* consists of both:
- An *input embedding*: taking as input a token ID given by the tokenizer
- A *positional encoding*: taking as input the position of a token in the sentence.

**Question.** What is the input dimension of the *input embedding*? What about the *positional encoding*?

**Exercise.** Implement the `Embedding` sub-module by completing the `TODO` tags below.

In [None]:
class Embedding(nn.Module):
  def __init__(self, vocab_size, max_length, hidden_dim):
    super().__init__()
    # TODO: use the nn.Embedding layer to initialize the embedding and positional encoding below:
    self.embedding = ... # TODO: initialize the input embedding with the appropriate dimensions
    self.position_encoding = ... # TODO: initialize the positional encoding with the appropriate dimensions

  def forward(self, x):
    # TODO: implement the forward pass of the Embedding

**Solution.** Click below to check the solution.

In [None]:
# @title
class Embedding(nn.Module):
  def __init__(self, vocab_size, max_length, hidden_dim):
    super().__init__()
    # TODO: use the nn.Embedding layer to initialize the embedding and positional encoding below:
    self.embedding = nn.Embedding(vocab_size, hidden_dim)
    self.position_encoding = nn.Embedding(max_length, hidden_dim)

  def forward(self, x):
    _, seq_length = x.shape
    token_embeddings = self.embedding(x)
    pos_encodings = self.position_encoding(torch.arange(seq_length, device=x.device))
    return token_embeddings + pos_encodings

### Transformer Encoder-Only Model for Classification
We have implemented all necessary sub-modules ane we are now ready to implement the *end-to-end* architecture of the *transformer encoder-only* model. To that end, we will make use of:
- The `Embedding` sub-module
- The `TransformerEncoderBlock` sub-module
- A `nn.Linear` layer to act as a classification head.

**Note.** The output of the `TransformerEncoderBlock` sub-module is a tensor of shape $(b, s, d_h)$ where $b$ represents the batch size, $s$ the sequence length, and $d_h$ the hidden dimension of the model. In order to perform the classification task, we need to compute one logit per sequence in the batch. In order to do so, we use a strategy common to the *BERT family* of models: we only use the representation of the special token `[CLS]` as input to the classification head. Note that in order for this strategy to work, we will use a tokenizer that adds the token `[CLS]` at the beginning of each input token sequence.

**Exercise.** Implement the `TransformerClassifier` model by completing the `TODO` tags below.

In [None]:
class TransformerClassifier(nn.Module):
  def __init__(self,
               vocab_size,
               max_length,
               hidden_dim,
               num_heads,
               num_classes):
    super().__init__()

    self.embedding = ... # TODO: initialize the embedding with the appropriate parameters
    self.encoder = # TODO: initialize the encoder with the appropriate parameters
    self.classifier_head = # TODO: initialize the classifier head with the appropriate parameters

  def forward(self, x):
    x = ... # TODO: compute the embedding of x
    x = ... # TODO: compute the encoding of x
    x = x[:, 0, :] # We only use the encoding of the token [CLS] as input of the classification head.
    x = ... # TODO: compute the logits through the classification head
    return x

**Solution.** Click below to check the solution.

In [None]:
# @title
class TransformerClassifier(nn.Module):
  def __init__(self,
               vocab_size,
               max_length,
               hidden_dim,
               num_heads,
               num_classes):
    super().__init__()

    self.embedding = Embedding(vocab_size, max_length, hidden_dim)
    self.encoder = TransformerEncoderBlock(hidden_dim, num_heads)
    self.classifier_head = nn.Linear(hidden_dim, num_classes)

  def forward(self, x):
    x = self.embedding(x)
    x = self.encoder(x)
    x = x[:, 0, :]
    x = self.classifier_head(x)
    return x

## Data Preparation
We will train the `TransformerClassifier` model on a text classification task. The data set we will be using is the following [dataset](https://huggingface.co/datasets/dair-ai/emotion/viewer). It consists of sentences grouped into 6 different classes according to the main emotion they convey:

| Class ID | Emotion |
| -------- | ------- |
| 0 | sadness |
| 1 | joy |
| 2 | love |
| 3 | anger |
| 4 | fear |
| 5 | surprise |



In [None]:
!pip install datasets

**Questions.** Answer the following questions:
1. What type of object is the ``raw_dataset`` object ?
2. How many elements are there in the ``raw_dataset`` object ?
3. What type of object is the ``raw_dataset["train"]`` object ?
4. Describe the ``raw_dataset["train"]`` object.
5. Are the train, validation and test datasets balanced ?

**Exercise.** Print one of the elements of ``raw_dataset["train"]``.

In [None]:
from datasets import load_dataset

raw_dataset = load_dataset("dair-ai/emotion")
raw_dataset

In [None]:
# TODO: print one of the elements of raw_dataset["train"]

### Pre-trained Tokenizer
In order to convert the textual data in the above datasets into a format that can be used as input for our models, we first *tokenize* the text using a pre-trained tokenizer.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=True)

**Exercise.** Let's find out more about the tokenizer we will be using. Write code in order to answer the following questions:
1. What is the name of the tokenizer being used?
2. What is the size of the vocabulary?
3. What is the maximum model input length?
4. What special tokens does the tokenizer use? What are their IDs?

**Remark.** Check that the special token `[CLS]` is indeed one of the special tokens in the pre-trained tokenizer. What is its token ID?

In [None]:
# TODO: print the necessary information about the automatically load tokenizer

### Tokenizing the raw input
We next use the pre-trained tokenizer to tokenize the whole raw dataset:

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)


tokenized_dataset = raw_dataset.map(tokenize_function, batched=True)
tokenized_dataset.set_format("torch", columns=["input_ids", "label"])
tokenized_dataset

### Data Loaders
Finally, we create three separate pytorch data loaders in order to easily iterate through them during the training and evaluation phases.

In [None]:
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

train_loader = torch.utils.data.DataLoader(
    tokenized_dataset["train"],
    shuffle=True,
    batch_size=8,
    collate_fn=data_collator
)

val_loader = torch.utils.data.DataLoader(
    tokenized_dataset["validation"],
    shuffle=False,
    batch_size=8,
    collate_fn=data_collator
)

test_loader = torch.utils.data.DataLoader(
    tokenized_dataset["test"],
    shuffle=False,
    batch_size=8,
    collate_fn=data_collator
)

## Training and Evaluation
Now that we have both our model architecture defined and our data prepared, we can proceed to the last phase of the lab project: training and evaluating the model.

### Initialize model
We start by setting all the necessary arguments in order to instantiate the concrete model that we will be using for the classification task.

**Exercise.** Set all the necessary arguments:
- `VOCAB_SIZE`: the size of the vocabulary of the pre-trained tokenizer.
- `MAX_LENGTH`: the maximum length of the token sequence of the pre-trained tokenizer.
- `HIDDEN_DIM`: 256
- `NUM_HEADS`: 8
- `NUM_CLASSES`: The number of classes of the multi-class classification task.

**Question.** Make sure that the hidden dimension is divisible by the number of heads.

In [None]:
VOCAB_SIZE = ... # TODO: extract the vocabulary size from the tokenizer
MAX_LENGTH = ... # TODO: extract the maximum sequence length from the tokenizer
HIDDEN_DIM = ... # TODO: set the hidden dimension
NUM_HEADS = ... # TODO: set the number of heads
NUM_CLASSES = ... # TODO: extract the number of classes from the dataset

print(f"VOCAB_SIZE: {VOCAB_SIZE}")
print(f"MAX_LENGTH: {MAX_LENGTH}")
print(f"HIDDEN_DIM: {HIDDEN_DIM}")
print(f"NUM_HEADS: {NUM_HEADS}")
print(f"NUM_CLASSES: {NUM_CLASSES}")

**Solution.** Click below to check the solution.

In [None]:
# @title
VOCAB_SIZE = tokenizer.vocab_size
MAX_LENGTH = tokenizer.model_max_length
HIDDEN_DIM = 256
NUM_HEADS = 8
NUM_CLASSES = tokenized_dataset["train"].features["label"].num_classes

print(f"VOCAB_SIZE: {VOCAB_SIZE}")
print(f"MAX_LENGTH: {MAX_LENGTH}")
print(f"HIDDEN_DIM: {HIDDEN_DIM}")
print(f"NUM_HEADS: {NUM_HEADS}")
print(f"NUM_CLASSES: {NUM_CLASSES}")

**Exercise.** Instantiate the `classifier` model using the above arguments.

In [None]:
classifier = ... # TODO: instantiate the model
print(classifier)

**Solution.** Click below to check the solution.

In [None]:
# @title
classifier = TransformerClassifier(VOCAB_SIZE, MAX_LENGTH, HIDDEN_DIM, NUM_HEADS, NUM_CLASSES)
print(classifier)

### Train Model
We can now proceed to the model training phase. In order to do so, we will define two functions:
- A `train_epoch` function that will train the model by iterating through the given data loader once.
- An `evaluate` function that will evaluate the model on the given data loader.

In [None]:
import time
from tqdm import tqdm

In [None]:
def train_epoch(model, dataloader, optimizer, criterion):
    epoch_loss = 0
    epoch_acc = 0

    model.train()

    for batch in tqdm(dataloader, desc="Processing Batches"):
        optimizer.zero_grad()

        input_ids = batch["input_ids"]
        labels = batch['labels']

        outputs = model(input_ids)
        loss = criterion(outputs, labels)
        acc = (outputs.argmax(dim=1) == labels).float().mean()

        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(dataloader), epoch_acc / len(dataloader)

In [None]:
def evaluate(model, dataloader, criterion):
    epoch_loss = 0
    epoch_acc = 0
    model.eval()

    with torch.no_grad():
        for batch in tqdm(dataloader, desc="Processing Batches"):

            input_ids = batch["input_ids"]
            labels = batch['labels']

            outputs = model(input_ids)
            loss = criterion(outputs, labels)
            acc = (outputs.argmax(dim=1) == labels).float().mean()

            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(dataloader), epoch_acc / len(dataloader)

Let us proceed to the actual training of the model.

**Questions.** What loss function should we use to train the model?

**Exercise.** Train the model by:
- Setting the number of epochs to 2 and the learning rate to 0.001.
- Completing the `TODO` tags below.

In [None]:
EPOCHS = ... # TODO: set the number of epochs
LEARNING_RATE = ... # TODO: set the learning rate

optimizer = torch.optim.Adam(classifier.parameters(), lr=LEARNING_RATE)
criterion = ... # TODO: set the loss function

for epoch in range(EPOCHS):

    train_loss, train_acc = ... # TODO: train the model for one epoch
    valid_loss, valid_acc = ... # TODO: evaluate the model on the validation set

    epoch_time = time.time()

    print("")
    print(f'Epoch: {epoch+1:02} | Time: {epoch_time}')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

**Solution.** Click below to check the solution.

In [None]:
# @title
EPOCHS = 2
LEARNING_RATE = 1e-3

optimizer = torch.optim.Adam(classifier.parameters(), lr=LEARNING_RATE)
criterion = nn.CrossEntropyLoss().to(device)

for epoch in range(EPOCHS):

    train_loss, train_acc = train_epoch(classifier, train_loader, optimizer, criterion)
    valid_loss, valid_acc = evaluate(classifier, val_loader, criterion)

    epoch_time = time.time()

    print("")
    print(f'Epoch: {epoch+1:02} | Time: {epoch_time}')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

**Question.** Is the model overfitting? Should we keep training during more epochs?

### Evaluation
Finally, we evaluate the model on the test set.

In [None]:
test_loss, test_acc = evaluate(classifier, test_loader, criterion)
print("")
print(f'Test Loss: {test_loss:.3f} |  Test Acc: {test_acc*100:.2f}%')

**Exercise.** Try out the model in a bunch of sentences of your own.