# Data Preparation
In this note book, we will explore how the pre training phase, before we train the GPT model. 

# A few words before we start
This whole project will demonstrate how to train a GPT model architecture from scratch but we will just create a proof of concept project. Real GPT were trained on billions of tokens (Terrabytes of data with thousands of GPUs) to achieve the current GPT. Our project will just train on around a hundred thousands tokens with our model is around several millions parameters (quite small), and the training will take 10 to 30 mins depends on your machine computational resources. You will see at the end, although the model can't create meaningful sentence but it manage to create Vietnamese words. 

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

## Vocab
First of all we will need to prepare the vocab size of the Vietnamese language. Technically in here we will just parse through Truyen Kieu dataset and count for the unique characters in the text. In real life scenario, of course we will have a dictionary to know the correct words and we have to work on unexpected new words as well for example new unique name of certain things.

In [2]:
with open('data/truyen_kieu_clean.txt', 'r', encoding='utf-8') as f:
    text = f.read()

chars = sorted(list(set(text)))
vocab_size = len(chars)

print("Vocab size:", vocab_size)
print("Number of characters:", len(text))

Vocab size: 129
Number of characters: 104804


## Encoder and decoder
Decoder and encoder will be based on your tokenizer, how you will transform from text to numbers. In correct GPT paper, the tokenizer is sub-word, which mean it predict part of a word then glue them together to become words and sentences. In here since we only have really small amount of text, we will use character-by-character tokenizer.

In [3]:
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encoder = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decoder = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

test_text = """
Trăm năm trong cõi người ta,
Chữ tài chữ mệnh khéo là ghét nhau.
Trải qua một cuộc bể dâu,
Những điều trông thấy mà đau đớn lòng.
"""
print(encoder(test_text))
print(decoder(encoder(test_text)))

[0, 26, 47, 74, 42, 1, 43, 74, 42, 1, 49, 47, 44, 43, 37, 1, 34, 69, 39, 1, 43, 37, 82, 113, 39, 1, 49, 32, 4, 0, 12, 38, 122, 1, 49, 57, 39, 1, 34, 38, 122, 1, 42, 102, 43, 38, 1, 40, 38, 62, 44, 1, 41, 57, 1, 37, 38, 62, 49, 1, 43, 38, 32, 50, 6, 0, 26, 47, 84, 39, 1, 46, 50, 32, 1, 42, 111, 49, 1, 34, 50, 111, 34, 1, 33, 100, 1, 35, 59, 50, 4, 0, 20, 38, 122, 43, 37, 1, 76, 39, 99, 50, 1, 49, 47, 68, 43, 37, 1, 49, 38, 85, 53, 1, 42, 57, 1, 76, 32, 50, 1, 76, 112, 43, 1, 41, 66, 43, 37, 6, 0]

Trăm năm trong cõi người ta,
Chữ tài chữ mệnh khéo là ghét nhau.
Trải qua một cuộc bể dâu,
Những điều trông thấy mà đau đớn lòng.



## Data

In here, we split the data to train and eval data of ratio 0.9. We will train 90% of the data and eval the model on the other 10%.

In [6]:
data = torch.tensor(encoder(text), dtype=torch.long)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

## Batch

In the below code is the function to get batch each time we train the model. One special thing about GPT training is: it use steps instead of epochs to train the model, unlike other common application. The reason for this is for efficiency, in text data loading billions of tokens can be expensive, instead we randomly select some chunks of data here and there and train on those chunks.

In [7]:
def get_batch(data, batch_size:int = 64, block_size: int = 32):
    idx = torch.randint(len(train_data) - block_size, (batch_size, ))
    x = torch.stack([train_data[i:i+block_size] for i in idx])
    y = torch.stack([train_data[i+1:i+block_size+1] for i in idx])
    return x, y

x, y = get_batch(train_data)
print("x shape:", x.shape)
print("y shape:", y.shape)
print("x:", x)
print("y:", y)

x shape: torch.Size([64, 32])
y shape: torch.Size([64, 32])
x: tensor([[39, 98, 49,  ..., 43, 37,  1],
        [ 1, 43, 38,  ..., 38, 50,  1],
        [76, 98, 43,  ..., 50, 68, 43],
        ...,
        [ 1, 76, 60,  ..., 71, 32,  1],
        [ 6,  0, 18,  ..., 38, 64, 43],
        [53,  1, 48,  ..., 76, 57, 44]])
y: tensor([[98, 49,  1,  ..., 37,  1, 48],
        [43, 38, 58,  ..., 50,  1, 34],
        [98, 43,  1,  ..., 68, 43,  1],
        ...,
        [76, 60,  1,  ..., 32,  1, 47],
        [ 0, 18, 57,  ..., 64, 43,  1],
        [ 1, 48, 32,  ..., 57, 44,  4]])


## Flash Attention (Optional)

These code below are optional if you want to use Flash Attention instead of normal Attention in your training. The Flash Attention will calculate the attention values with 2x to 4x faster then traditional approach but it's optional. Without Flash Attention, you can still train your model fine without any errors.

In [8]:
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cuda.enable_flash_sdp(True)

print(torch.backends.cuda.matmul.allow_tf32)  # Should print True
print(torch.cuda.is_available())  # Should print True if CUDA is available
print(torch.__version__)  # Check your PyTorch version
print(torch.version.cuda)  # Check the CUDA version PyTorch was built with
print(torch.backends.cudnn.version())  # Check the cuDNN version

True
True
2.5.1+cu121
12.1
90100
