# HOWTO: Write A Tiny Language Model (TLM)

> There's no comfort, you just choose your burden.
> -- Loki S2

## Write at the very begining:

The majority of the paragraph is composed by ChatGPT, which in turn provides me with an increased opportunity to focus on the coding aspect.

## Package Import

Let's proceed to import the necessary Python packages that will be required for our future tasks.

In [1]:
import vectorlab as vl

import torch
import torch.nn as nn
import torch.nn.functional as F

from typing import List
from collections import OrderedDict

from torch.utils.data.dataset import TensorDataset
from torch.utils.data.dataloader import DataLoader

[2023-11-16 13:49:09,678] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect)


## Tokenizer

A tokenizer is a component in Natural Language Processing (NLP) that breaks up text into smaller pieces, called tokens. These tokens help machines understand and interpret human language by converting text into a format that's easier for algorithms to process.

Tokenization is a crucial step in text preprocessing for many NLP tasks, such as sentiment analysis, text classification, and machine translation.

The simplest form of tokenization is known as character-level tokenization. In this method, every character in the sentence is treated as a separate token.

For example, the sentence "I love AI!" would be tokenized into ["I", " ", "l", "o", "v", "e", " ", "A", "I", "!"] using character-level tokenization.

This approach can be useful in certain scenarios, such as language modeling or text generation tasks, or when dealing with languages where a 'token' is not easily defined as a space-separated word, like in Chinese or Japanese. However, it can also result in a larger and more complex model due to the increased number of tokens.

### Set up vocabulary

Setting up a vocabulary for tokenization involves creating a list or a dictionary of unique tokens (words, characters, subwords, etc.) from your text data. Here are the general steps:

1. Tokenization: First, you need to tokenize your text data. This could be done at the word level, character level, or using more complex methods like subword tokenization. The choice of method depends on your specific task.

2. Building the Vocabulary: After tokenization, you create your vocabulary by collecting all unique tokens. Each unique token will correspond to a unique integer ID. Often, special tokens are added to the vocabulary, such as <PAD> for padding, <UNK> for unknown words, <SOS> and <EOS> for indicating the start and end of sentences in some tasks.

3. Indexing: Assign each unique token in your vocabulary an index (integer). This is necessary because machine learning models don't understand text, but they do understand numbers.

4. Encoding and Decoding: With the vocabulary and the index, you can now convert (encode) your text data into sequences of integers for your model to process. After the model has made its predictions, you can convert (decode) its output back into text.

#### tokenization, building the vocabulary and indexing

We have opted for a compact dataset to establish our vocabulary and are employing the most straightforward character-based tokenizer.

For the sake of simplicity in our implementation, we have refrained from adding any special tokens. 

Here, we are utilizing a small dataset known as TinyStories. You could download this dataset [here](https://huggingface.co/datasets/roneneldan/TinyStories).

We use the _TinyStoriesV2-GPT4-train.txt_ file, and we extract first 10000 rows as our dataset.

In [2]:
with open("./train.txt", "r") as f:
    lines = f.read().strip()
    
stories = lines.split("<|endoftext|>\n")

print(f"number of stories: {len(stories)}")

vocab = sorted(set("".join(stories)))
vocab_size = len(vocab)

print(f"vocab size: {vocab_size}")

idx2chr = dict(zip(range(vocab_size), vocab))
chr2idx = dict(zip(idx2chr.values(), idx2chr.keys()))

number of stories: 1728
vocab size: 73


#### encoding and decoding

We have developed our encoding and decoding methods grounded on the vocabulary we previously constructed.

In [3]:
def encode(sentence: str) -> List[int]:
    tokens = [chr2idx[chr_] for chr_ in sentence]
    return tokens
    
def decode(tokens: List[int]) -> str:
    sentence = [idx2chr[idx] for idx in tokens]
    return "".join(sentence)

sentence = "hello world!"
tokens = encode(sentence)
decoded_sentence = decode(tokens)

print(
    f"raw sentence: {sentence}\n"
    f"encoded tokens: {tokens}\n"
    f"decoded sentence: {decoded_sentence}"
)

raw sentence: hello world!
encoded tokens: [49, 46, 53, 53, 56, 1, 64, 56, 59, 53, 45, 2]
decoded sentence: hello world!


### set up dataset

Let's proceed to examine our tokenized dataset and compute the total number of tokens contained within it.

In [4]:
dataset = torch.tensor(encode("".join(stories))).long()

print(f"Total tokens: {dataset.size()[0]}")

Total tokens: 1399926


We have approximately one million tokens! While this may seem substantial, it's important to remember that our tokenization is solely character-based, with a vocabulary size of only 73.

Modern Large Language Models (LLMs) typically have a significantly larger vocabulary size and have been trained on trillions of tokens. Therefore, while our TLM is compact, it's important to manage expectations regarding its performance.

## Training

Training a Language Model involves several steps. Here's a general outline:

1. Data Preparation: Collect and preprocess a large corpus of text data. This could involve cleaning the text (removing punctuation, converting to lowercase, etc.), tokenizing the text into words or subwords, and encoding the tokens into numerical values using a vocabulary.

2. Model Architecture: Decide on the architecture of your model. This could be a simple Recurrent Neural Network (RNN), or more complex architectures like Long Short-Term Memory (LSTM), Gated Recurrent Units (GRU), or Transformer models. The choice of architecture depends on your specific task and the amount of data you have.

3. Model Training: Train the model on your data. This usually involves feeding the model a sequence of tokens and having it predict the next token in the sequence. The model's predictions are compared to the actual next tokens to calculate a loss, which is then minimized using an optimization algorithm like stochastic gradient descent.

4. Evaluation: Evaluate the model on a separate validation set to check its performance. This could involve calculating the perplexity of the model, which measures how well it predicts the validation data.

5. Tuning: Based on the model's performance, you might need to tune its hyperparameters (like the learning rate, the batch size, or the architecture of the model itself) and repeat the training and evaluation steps.

6. Inference: Once the model is trained, it can be used to generate new text by feeding it a seed sequence and having it predict the next token, then feeding the predicted token back into the model to generate the next token, and so on.

### Training Hyperparamters

Before delving into the intricate steps of training a Tiny Language Model (TLM), let's first define some essential hyperparameters.

In [5]:
batch_size = 10240
context_length = 32

d_model = 128
n_layers = 4
n_heads = 8

num_epochs = 2

### Data Preparation

We will partition the dataset into three subsets: training, validation, and testing. Each subset of the dataset will be structured into an appropriate format.

In [6]:
def construct_dataset(dataset: torch.tensor) -> TensorDataset:
    xs, ys = [], []
    
    for i in range(0, len(dataset) - context_length):
        x = dataset[i:i + context_length]
        y = dataset[i + 1:i + context_length + 1]
        
        xs.append(x)
        ys.append(y)
        
    xs = torch.stack(xs).long()
    ys = torch.stack(ys).long()
    
    return TensorDataset(xs, ys)

In [7]:
train_dataset = construct_dataset(
    dataset[:int(.8 * len(dataset))]
)
valid_dataset = construct_dataset(
    dataset[int(.8 * len(dataset)):int(.9 * len(dataset))]
)
test_dataset = construct_dataset(
    dataset[int(.9 * len(dataset)):]
)

Let's examine the contents of the dataset.

In [8]:
xs, ys = next(DataLoader(train_dataset, batch_size=4, shuffle=True).__iter__())

[(decode(xs[i].tolist()), decode(ys[i].tolist())) for i in range(4)]

[('u have to be careful. You have t', ' have to be careful. You have to'),
 ('e could hop so fast on the carpe', ' could hop so fast on the carpet'),
 ('he fireplace. It is nice and war', 'e fireplace. It is nice and warm'),
 ('he could not find his favorite t', 'e could not find his favorite to')]

### Model Architecture

We'll begin with a very basic model and gradually incorporate the essential components found in modern Large Language Models (LLMs) one at a time.

#### Base Model

We'll start with a basic model that only includes a naive embedding and linear layers. This simple model will serve as our foundation, which we can progressively build upon by adding more complex layers and functionalities.

In [9]:
class BaseModel(nn.Module):
    
    def __init__(self, vocab_size, d_model):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model),
            nn.ReLU(),
            nn.Linear(d_model, vocab_size)
        )
        
    def forward(self, x):
        
        x = self.embedding(x)
        x = self.ffn(x)
        
        return x
    
    def loss_fn(self, yhat, y):
        
        yhat = yhat.view(-1, yhat.size(-1))
        y = y.view(-1)
        
        loss = F.cross_entropy(yhat, y)
        
        return loss

In [10]:
net = BaseModel(vocab_size, d_model)

explorer = vl.nn.Explorer(
    net=net,
    loss_fn=net.loss_fn,
    batch_input="x, y",
    net_input="x",
    loss_input="y",
    batch_size=batch_size,
    num_epochs=num_epochs,
    learning_rate=1e-3,
    optimizer_fn="adamw",
    scheduler_fn=None,
    earlystopping_fn=None,
)

explorer.train(
    train_dataset, valid_dataset,
    save_best=False, save_last=False,
    verbose=1
)

Training with parameters: batch_size=10240, num_workers=0, num_epochs=2, lr=0.001, weight_decay=0 on devices cuda
Result: train loss 2.295810, valid loss 2.293752


batch_size=10240, num_workers=0, num_epochs=2, lr=0.001, weight_decay=0

Now we have a baseline loss with 2.294791 / 2.292472.

#### RMSNorm

RMSNorm is a type of normalization technique used in deep learning. It stands for Root Mean Square Normalization.

Normalization techniques are used to standardize the inputs to a layer in a neural network, which can help the network learn more effectively. Other popular normalization techniques include Batch Normalization, Layer Normalization, and Instance Normalization.

RMSNorm is designed to overcome some of the limitations of these techniques. Specifically, it normalizes the features by their root mean square (RMS) values, which makes it more effective for Recurrent Neural Networks (RNNs) and attention-based models.

The RMSNorm technique calculates the root mean square of the features and then divides each feature by this value. This ensures that the RMS value of the features is 1, which can help stabilize the learning process and improve the performance of the model.


The Root Mean Square (RMS) can be calculated using the following formula:

$$
RMS(x) = \sqrt{\frac{1}{n} \sum^{n}_{i=1} x_i^2}
$$

$$
\bar{x_i} = \frac{x_i}{RMS(x)} * \gamma + \beta
$$

In [11]:
class RMSNorm(nn.Module):
    
    def __init__(self, features, bias=True, eps=1e-8):
        
        super().__init__()
        
        self.weight = nn.Parameter(torch.ones(features))
        
        if bias:
            self.bias = nn.Parameter(torch.zeros(features))
        else:
            self.register_parameter("bias", None)
        
        self.eps = eps
        
    def forward(self, x):
        
        rms = torch.sqrt(
            (x).pow(2).mean(dim=-1, keepdim=True) + self.eps
        )
        x = x / rms
        
        x = self.weight * x
        if self.bias is not None:
            x = x + self.bias
        
        return x

RMSNorm is used as a pre-normalization step. It helps to stabilize the learning process and accelerates the training of deep neural networks.

Now, we will add this component to our base model to enhance its performance and efficiency.

In [12]:
class BaseModelRMS(nn.Module):
    
    def __init__(self, vocab_size, context_length, d_model):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.rms = RMSNorm((context_length, d_model), bias=True)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model),
            nn.ReLU(),
            nn.Linear(d_model, vocab_size)
        )
        
    def forward(self, x):
        
        x = self.embedding(x)
        x = self.rms(x)
        x = self.ffn(x)
        
        return x
    
    def loss_fn(self, yhat, y):
        
        yhat = yhat.view(-1, yhat.size(-1))
        y = y.view(-1)
        
        loss = F.cross_entropy(yhat, y)
        
        return loss

In [13]:
net = BaseModelRMS(vocab_size, context_length, d_model)

explorer = vl.nn.Explorer(
    net=net,
    loss_fn=net.loss_fn,
    batch_input="x, y",
    net_input="x",
    loss_input="y",
    batch_size=batch_size,
    num_epochs=num_epochs,
    learning_rate=1e-3,
    optimizer_fn="adamw",
    scheduler_fn=None,
    earlystopping_fn=None,
)

explorer.train(
    train_dataset, valid_dataset,
    save_best=False, save_last=False,
    verbose=1
)

Training with parameters: batch_size=10240, num_workers=0, num_epochs=2, lr=0.001, weight_decay=0 on devices cuda
Result: train loss 2.294807, valid loss 2.293107


batch_size=10240, num_workers=0, num_epochs=2, lr=0.001, weight_decay=0

The loss is now 2.293807 / 2.292060. it's essential not to concede defeat but to persist and progress.

#### SwiGLU

SwiGLU stands for Swish Gated Linear Unit. It is a type of activation function used in neural networks. The SwiGLU is a combination of the Swish activation function and the Gated Linear Unit (GLU).

The Swish activation function is a self-gated activation function introduced by researchers at Google. The formula for the Swish function is:

$$
swish_\beta(x) = x * sigmoid(\beta * x)
$$

where $x$ is the input to the function, $sigmoid$ is the sigmoid function, and $\beta$ is a learnable parameter.

The Gated Linear Unit (GLU) is a type of activation function that uses a gating mechanism to control the information flow in a neural network. The GLU takes as input a vector x and outputs a vector of the same size, where each element is the product of the corresponding element in x and a gating value (between 0 and 1) computed from x.

The SwiGLU combines these two ideas into a single activation function.

The formula of SwiGLU can be calculated as:

$$
SwiGLU(x) = swish_\beta(x * W + b) * (x * V + x)
$$

In [16]:
class SwiGLU(nn.Module):
    
    def __init__(self, dim):
        
        super().__init__()
        
        self.w = nn.Linear(dim, dim)
        self.v = nn.Linear(dim, dim)
        
        self.beta = nn.Parameter(torch.ones(1))
        
    def forward(self, x):
        
        swish = self.w(x) * torch.sigmoid(self.beta * self.w(x))
        act = swish * self.v(x)
        
        return act

We are now in a position to utilize SwiGLU as a substitute for the activation function within the foundational model.

In [17]:
class BaseModelRMSSwiGLU(nn.Module):
    
    def __init__(self, vocab_size, context_length, d_model):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.rms = RMSNorm((context_length, d_model), bias=True)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model),
            SwiGLU(d_model),
            nn.Linear(d_model, vocab_size)
        )
        
    def forward(self, x):
        
        x = self.embedding(x)
        x = self.rms(x)
        x = self.ffn(x)
        
        return x
    
    def loss_fn(self, yhat, y):
        
        yhat = yhat.view(-1, yhat.size(-1))
        y = y.view(-1)
        
        loss = F.cross_entropy(yhat, y)
        
        return loss

In [18]:
net = BaseModelRMSSwiGLU(vocab_size, context_length, d_model)

explorer = vl.nn.Explorer(
    net=net,
    loss_fn=net.loss_fn,
    batch_input="x, y",
    net_input="x",
    loss_input="y",
    batch_size=batch_size,
    num_epochs=num_epochs,
    learning_rate=1e-3,
    optimizer_fn="adamw",
    scheduler_fn=None,
    earlystopping_fn=None,
)

explorer.train(
    train_dataset, valid_dataset,
    save_best=False, save_last=False,
    verbose=1
)

Training with parameters: batch_size=10240, num_workers=0, num_epochs=2, lr=0.001, weight_decay=0 on devices cuda
Result: train loss 2.286692, valid loss 2.285747


batch_size=10240, num_workers=0, num_epochs=2, lr=0.001, weight_decay=0

The loss is even better 2.286692 / 2.285747. Just keep going!

#### Rotary Embedding

We now approach the most significant section.

Rotary Embedding is a technique used in Transformer models to encode the position of tokens in a sequence. It was introduced in the paper "RoFormer: Enhanced Transformer with Rotary Position Embedding" by Jiyuan Zhang et al.

Traditional Transformer models use absolute position embeddings, where each position in the sequence has a unique embedding. This approach has some limitations, such as a maximum sequence length and difficulty generalizing to longer sequences.

Rotary Embedding, on the other hand, uses relative position embeddings. It encodes the relative positions of tokens with respect to each other, rather than their absolute positions in the sequence. This allows the model to handle sequences of any length and to generalize better to longer sequences.

The key idea of Rotary Embedding is to apply a rotation operation to the token embeddings based on their positions. This rotation is done in the complex number space, which allows the model to capture the cyclic nature of the position information.

The Rotary Embedding is calculated as follows:

1. Convert the position indices into continuous values in the range [0, 1].

2. Map these values into the complex number space using the sine and cosine functions.

3. Apply a rotation operation to the token embeddings using these complex values.

This approach allows the model to capture both the order and the distance between tokens in a sequence.


**ME**, not ChatGPT: I believe that ChatGPT could provide you with a broader perspective. However, if you're particularly interested in the beautiful mathematical aspect of positional embedding, I would highly recommend reading this [blog](https://kazemnejad.com/blog/transformer_architecture_positional_encoding/), and simply taking pleasure in the reading process.

In the original RoFormer paper, we can find the general form of rotary matrix defined as follows:

$$
R^d_{\Theta, m} = 
\begin{pmatrix}
\cos m\theta_1 & -\sin m\theta_1 & 0 & 0 & \dots & 0 & 0 \\
\sin m\theta_1 & \cos m\theta_1 & 0 & 0 & \dots & 0 & 0 \\
0 & 0 & \cos m\theta_2 & -\sin m\theta_2 & \dots & 0 & 0 \\
0 & 0 & \sin m\theta_2 & \cos m\theta_2 & \dots & 0 & 0 \\
\vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\
0 & 0 & 0 & 0 & \dots & \cos m\theta_{d/2} & -\sin m\theta_{d/2} \\
0 & 0 & 0 & 0 & \dots & \sin m\theta_{d/2} & \cos m\theta_{d/2}
\end{pmatrix}
$$

where pre-defined parameter $\Theta = \{ \theta_i = 10000^{-2(i - 1) / d}, i \in [1, 2, \dots, d / 2] \}$

In [19]:
def get_rotary_matrix(context_length, d_model):
    
    R = torch.zeros((context_length, d_model, d_model), requires_grad=False)
    
    for m in range(context_length):
        for i in range(d_model // 2):
            
            theta = 10000.0 ** (-2.0 * (i - 1) / d_model)
            m_theta = torch.tensor(m * theta)
            
            R[m, 2 * i, 2 * i] = torch.cos(m_theta)
            R[m, 2 * i, 2 * i + 1] = -torch.sin(m_theta)
            R[m, 2 * i + 1, 2 * i] = torch.sin(m_theta)
            R[m, 2 * i + 1, 2 * i + 1] = torch.cos(m_theta)
            
    return R

We are now able to integrate this rotary matrix within the self-attention mechanism. Please note that we are only capable of computing the self-attention weight for preceding tokens.

In [20]:
class RoPECausalAttention(nn.Module):
    
    def __init__(self, context_length, d_model, n_heads):
        
        super().__init__()
        
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_head = self.d_model // self.n_heads
        
        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
        self.out = nn.Linear(d_model, d_model)
        
    def forward(self, x):
        
        # batch_size, context_length, d_model
        B, C, D = x.size()
        
        q, k, v = self.qkv(x).split(self.d_model, dim=-1)
        
        # make q, k, v into batch_size, n_heads, context_length, d_heads
        q = q.view(B, C, self.n_heads, self.d_head).transpose(1, 2)
        k = k.view(B, C, self.n_heads, self.d_head).transpose(1, 2)
        v = v.view(B, C, self.n_heads, self.d_head).transpose(1, 2)
        
        # get rotary matrix
        R = get_rotary_matrix(C, self.d_head).to(q.device)
        
        # rotate q, k
        q_rotated = torch.einsum("bhci,cij->bhcj", q, R)
        k_rotated = torch.einsum("bhci,cij->bhcj", k, R)
        
        # calculate attention
        attn_numerator = torch.exp(
            (q_rotated @ k_rotated.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.d_head))
        )
        attn_denominator = torch.exp(
            (q @ k.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.d_head))
        )
        attn_denominator = torch.sum(attn_denominator, dim=-1, keepdim=True)
        
        attn = attn_numerator / attn_denominator
        
        # mask attention to make it causal
        attn_mask = torch.tril(torch.ones(C, C)).view(1, 1, C, C).to(attn.device)
        attn = attn.masked_fill(attn_mask[:, :, :C, :C] == 0, .0)
        
        # batch_size, n_heads, context_length, d_heads
        y = attn @ v
        
        # re-assemble all heads
        y = y.transpose(1, 2).contiguous().view(B, C, D)
        
        # out projection
        y = self.out(y)
        
        return y

Let's incorporate this module into the model.

In [21]:
class BaseModelRMSSwiGLURoPE(nn.Module):
    
    def __init__(self, vocab_size, context_length, d_model, n_heads):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.rms = RMSNorm((context_length, d_model), bias=True)
        self.attn = RoPECausalAttention(context_length, d_model, n_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model),
            SwiGLU(d_model),
            nn.Linear(d_model, vocab_size)
        )
        
    def forward(self, x):
        
        x = self.embedding(x)
        x = self.rms(x)
        x = self.attn(x)
        x = self.ffn(x)
        
        return x
    
    def loss_fn(self, yhat, y):
        
        yhat = yhat.view(-1, yhat.size(-1))
        y = y.view(-1)
        
        loss = F.cross_entropy(yhat, y)
        
        return loss

In [22]:
net = BaseModelRMSSwiGLURoPE(vocab_size, context_length, d_model, n_heads)

explorer = vl.nn.Explorer(
    net=net,
    loss_fn=net.loss_fn,
    batch_input="x, y",
    net_input="x",
    loss_input="y",
    batch_size=batch_size,
    num_epochs=num_epochs,
    learning_rate=1e-3,
    optimizer_fn="adamw",
    scheduler_fn=None,
    earlystopping_fn=None,
)

explorer.train(
    train_dataset, valid_dataset,
    save_best=False, save_last=False,
    verbose=1
)

Training with parameters: batch_size=10240, num_workers=0, num_epochs=2, lr=0.001, weight_decay=0 on devices cuda
Result: train loss 1.887828, valid loss 1.885723


batch_size=10240, num_workers=0, num_epochs=2, lr=0.001, weight_decay=0

Now the loss is even better with 1.887828 / 1.885723.

#### Go deeper

Let's transform the current model into a block and stack additional blocks to construct a deep neural network.

In [24]:
class TLMBlock(nn.Module):
    
    def __init__(self, context_length, d_model, n_heads):
        
        super().__init__()
        
        self.rms = RMSNorm((context_length, d_model))
        self.attn = RoPECausalAttention(context_length, d_model, n_heads)
        
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model),
            SwiGLU(d_model),
        )
        
    def forward(self, x):
        
        # pre-normalization
        x = self.rms(x)
        
        # attention and skip connection
        x = x + self.attn(x)
        
        # pre-normalization
        x = self.rms(x)
        
        # ffn and skip connection
        x = x + self.ffn(x)
        
        return x
    

class TLM(nn.Module):
    
    def __init__(self, vocab_size, context_length, d_model, n_layers, n_heads):
        
        super().__init__()
        
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.blocks = nn.Sequential(
            OrderedDict(
                [
                    (f"TLMBlock_{i}", TLMBlock(context_length, d_model, n_heads))
                    for i in range(n_layers)
                ]
            )
        )
        
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model),
            SwiGLU(d_model),
            nn.Linear(d_model, vocab_size)
        )
        
        
    def forward(self, x):
        
        x = self.embedding(x)
        x = self.blocks(x)
        x = self.ffn(x)
        
        return x
    
    def loss_fn(self, yhat, y):
        
        yhat = yhat.view(-1, yhat.size(-1))
        y = y.view(-1)
        
        loss = F.cross_entropy(yhat, y)
        
        return loss

Given that the neural network is more profound, we should allow it additional time to reach convergence.

In [38]:
num_epochs = 10

In [39]:
net = TLM(vocab_size, context_length, d_model, n_layers, n_heads)

explorer = vl.nn.Explorer(
    net=net,
    loss_fn=net.loss_fn,
    batch_input="x, y",
    net_input="x",
    loss_input="y",
    batch_size=batch_size,
    num_epochs=num_epochs,
    learning_rate=1e-3,
    optimizer_fn="adamw",
    scheduler_fn=None,
    earlystopping_fn=None,
)

explorer.train(
    train_dataset, valid_dataset,
    save_best=False, save_last=False,
    verbose=1
)

Training with parameters: batch_size=10240, num_workers=0, num_epochs=10, lr=0.001, weight_decay=0 on devices cuda
Result: train loss 0.041278, valid loss 0.044005


batch_size=10240, num_workers=0, num_epochs=10, lr=0.001, weight_decay=0

Significant progress has been made! You can now observe the outcome of your diligent efforts.

### Inference

Let's now take a glimpse at the output generated by your trained model.

In [44]:
@torch.no_grad()
def inference(net, prompt, max_new_tokens=64):
    
    encoded_prompt = torch.tensor(encode(prompt)).unsqueeze(0).cuda()

    
    for _ in range(max_new_tokens):
        
        logits = net(encoded_prompt[:, -context_length:])
        last_logits = logits[:, -1, :]
        p = F.softmax(last_logits, dim=-1)
        
        next_token = p.argmax(keepdim=True)
        encoded_prompt = torch.cat((encoded_prompt, next_token), dim=-1)
        
    return [decode(x) for x in encoded_prompt.tolist()]
    
inference(net, "Once upon a time, there was a jolly frog named Bob.")

['Once upon a time, there was a jolly frog named Bob. She saw a big started to the ball and said, "Yes, "Yes, she sai']

Up until now, all the magical aspects of building a Tiny Language Model (TLM) have been revealed.

## Write at the very end:

You might find that the output of the trained model doesn't make much sense, but the goal is to give you a basic understanding of how to build a language model. By expanding and deepening this TLM, and gathering more data to train it, you're on the path to developing a robust LLM of your own one day.

I hope you enjoyed reading this tutorial.