# Dev notebook for data exploration and model baselines

Preliminary exploration of Russian literature text dataset.

In [486]:
# path to data file
data_path = 'data/tiny-russian-lit/very_clean_tiny_russian_lit.txt'

In [487]:
# read it in for inspection
with open(data_path, 'r', encoding='utf-8') as f:
    text = f.read()

In [488]:
print(f'Length of dataset at {data_path} is {len(text)} characters')

Length of dataset at data/tiny-russian-lit/very_clean_tiny_russian_lit.txt is 34824628 characters


In [489]:
print(f'First 1000 characters of the dataset:\n {text[:1000]}')

First 1000 characters of the dataset:
 Михаил Лермонтов
  

Выхожу один я на дорогу;
Сквозь туман кремнистый путь блестит;
Ночь тиха. Пустыня внемлет богу,
И звезда с звездою говорит.

В небесах торжественно и чудно!
Спит земля в сиянье голубом...
Что же мне так больно и так трудно?
Жду ль чего? жалею ли о чем?

Уж не жду от жизни ничего я,
И не жаль мне прошлого ничуть;
Я ищу свободы и покоя!
Я б хотел забыться и заснуть!

Но не тем холодным сном могилы...
Я б желал навеки так заснуть,
Чтоб в груди дремали жизни силы,
Чтоб, дыша, вздымалась тихо грудь;

Чтоб всю ночь, весь день мой слух лелея,
Про любовь мне сладкий голос пел,
Надо мной чтоб, вечно зеленея,
Темный дуб склонялся и шумел.
Михаил Лермонтов
ВАЛЕРИК
Я к вам пишу случайно; право,
Не знаю как и для чего.
Я потерял уж это право.
И что скажу вам? — ничего!
Что помню вас? — но, боже правый,
Вы это знаете давно;
И вам, конечно, все равно.
И знать вам также нету нужды,
Где я? что я? в какой глуши?
Душою мы друг другу чужды,
Да вр

In [490]:
# find the unique characters that occur in the text
chars = sorted(list(set(text)))
vocab = ''.join(chars)
vocab_size = len(chars)
print(f'Text vocabulary: {vocab}\nVocabulary size: {vocab_size}')

Text vocabulary: 
 !&,-.:;?i ̀́ЁІЉАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяёєі–—’
Vocabulary size: 87


Now, we need to be able to tokenize our input - convert raw string text into a sequence of integers according to our vocabulary of possible elements.

For a character-level language model, each character in our vocabulary gets tokenized.

In [491]:
# create a simple character-level tokenizer: a mapping from characters to integers
stoi = {ch: i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s] # encoder: convert string to list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: convert list of integers to string

In [492]:
def verify(string):
    print(f"The string '{string}' has the encoding {encode(string)}")
    print(decode(encode(string)) == string)

In [493]:
char = decode([1])
utf8_encoded = char.encode('utf-8')
print(utf8_encoded.hex())

20


In [494]:
verify(' ')
verify('\n')
verify('и')
verify('Мой дядя самых честных правил')

The string ' ' has the encoding [1]
True
The string '
' has the encoding [0]
True
The string 'и' has the encoding [57]
True
The string 'Мой дядя самых честных правил' has the encoding [29, 63, 58, 1, 53, 80, 53, 80, 1, 66, 49, 61, 76, 70, 1, 72, 54, 66, 67, 62, 76, 70, 1, 64, 65, 49, 51, 57, 60]
True


In [495]:
# encode the entire text dataset and store in a tensor
import torch

data = torch.tensor(encode(text), dtype=torch.long)
print(f'Input data tensor has shape {data.shape} and type {data.dtype}')
print(f'First 1000 elements of data tensor:\n {data[:1000]}')

Input data tensor has shape torch.Size([34824628]) and type torch.int64
First 1000 elements of data tensor:
 tensor([29, 57, 70, 49, 57, 60,  1, 28, 54, 65, 61, 63, 62, 67, 63, 51,  0,  1,
         1,  0,  0, 19, 76, 70, 63, 55, 68,  1, 63, 53, 57, 62,  1, 80,  1, 62,
        49,  1, 53, 63, 65, 63, 52, 68,  8,  0, 34, 59, 51, 63, 56, 77,  1, 67,
        68, 61, 49, 62,  1, 59, 65, 54, 61, 62, 57, 66, 67, 76, 58,  1, 64, 68,
        67, 77,  1, 50, 60, 54, 66, 67, 57, 67,  8,  0, 30, 63, 72, 77,  1, 67,
        57, 70, 49,  6,  1, 32, 68, 66, 67, 76, 62, 80,  1, 51, 62, 54, 61, 60,
        54, 67,  1, 50, 63, 52, 68,  4,  0, 25,  1, 56, 51, 54, 56, 53, 49,  1,
        66,  1, 56, 51, 54, 56, 53, 63, 79,  1, 52, 63, 51, 63, 65, 57, 67,  6,
         0,  0, 19,  1, 62, 54, 50, 54, 66, 49, 70,  1, 67, 63, 65, 55, 54, 66,
        67, 51, 54, 62, 62, 63,  1, 57,  1, 72, 68, 53, 62, 63,  2,  0, 34, 64,
        57, 67,  1, 56, 54, 61, 60, 80,  1, 51,  1, 66, 57, 80, 62, 77, 54,  1,
        52,

In [496]:
# split data into train and validation sets to test for overfitting
split = 0.8
n = int(split*len(data))
train_data = data[:n]
val_data = data[n:]

Block size, or context length, is the max length of any individual chunk of text that the transformer is trained on. A chunk of text of length `block_size + 1` has `block_size` individual training examples. This also means that the size of the input to the transformer at sampling time will never exceed `block_size`.

In [497]:
block_size = 8
first_block = train_data[:block_size + 1]
print(f'First block of the training data, + 1 character: {first_block}')

First block of the training data, + 1 character: tensor([29, 57, 70, 49, 57, 60,  1, 28, 54])


For a given block of text with length block_size + 1, we will train the transformer on each sequence/target pair from length 1 to block_size (where target is character immediately following the last character in the sequence). This is done so that the transformer is 'used' to predicting the next token given contexts of length as small as 1 and as large as block_size. This is important at sampling time, where the transformer has to begin generating targets from a context of potentially less than block_size.

In [498]:
print(f'Training examples/sequences in first block of data')
for i in range(1, block_size + 1):
    print(f'{i}/{block_size}: When input is, {first_block[:i]} target is {first_block[i]}')

Training examples/sequences in first block of data
1/8: When input is, tensor([29]) target is 57
2/8: When input is, tensor([29, 57]) target is 70
3/8: When input is, tensor([29, 57, 70]) target is 49
4/8: When input is, tensor([29, 57, 70, 49]) target is 57
5/8: When input is, tensor([29, 57, 70, 49, 57]) target is 60
6/8: When input is, tensor([29, 57, 70, 49, 57, 60]) target is 1
7/8: When input is, tensor([29, 57, 70, 49, 57, 60,  1]) target is 28
8/8: When input is, tensor([29, 57, 70, 49, 57, 60,  1, 28]) target is 54


In [499]:
torch.manual_seed(3)
batch_size = 4  # the number of independent sequences that we will process in parallel
block_size = 8  # maximum context length for predictions

def get_batch(split):
    # generate a batch of data consisting of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))  # generate batch_size random offsets in the interval [0, len(data) - batch_size)
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('-' * 10)

for b in range(batch_size): # batch dimension
    print(f'Batch {b + 1}/{batch_size}')
    for t in range(block_size): # time/position dimension
        context = xb[b, : t+1]
        target = yb[b, t]
        print(f'When input is {context.tolist()}, target is {target}')

inputs:
torch.Size([4, 8])
tensor([[70,  6,  1, 31, 67, 65, 80, 53],
        [49,  1, 72, 54, 60, 63, 51, 54],
        [66, 54, 50, 54,  1, 57, 61, 57],
        [65, 62, 63, 54,  4,  1, 55, 53]])
targets:
torch.Size([4, 8])
tensor([[ 6,  1, 31, 67, 65, 80, 53,  1],
        [ 1, 72, 54, 60, 63, 51, 54, 59],
        [54, 50, 54,  1, 57, 61, 57,  1],
        [62, 63, 54,  4,  1, 55, 53, 54]])
----------
Batch 1/4
When input is [70], target is 6
When input is [70, 6], target is 1
When input is [70, 6, 1], target is 31
When input is [70, 6, 1, 31], target is 67
When input is [70, 6, 1, 31, 67], target is 65
When input is [70, 6, 1, 31, 67, 65], target is 80
When input is [70, 6, 1, 31, 67, 65, 80], target is 53
When input is [70, 6, 1, 31, 67, 65, 80, 53], target is 1
Batch 2/4
When input is [49], target is 1
When input is [49, 1], target is 72
When input is [49, 1, 72], target is 54
When input is [49, 1, 72, 54], target is 60
When input is [49, 1, 72, 54, 60], target is 63
When input is [4

Probably the simplest language model is a bi-gram with character-based tokens. Given a single character, it predicts the next character in the sequence. I now implement a bi-gram as a baseline for our Russian text generation task.

In [500]:
import torch
import torch.nn as nn
from torch.nn import functional as F

class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # each token reads off the logits (input to softmax) for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
        
    def forward(self, idx, targets=None):
        # idx and targets are both (B, T) tensor of integers (B = # batches, T = # timesteps/block size)
        # we are essentially predicting the next character based on the embedding of a single token
        logits = self.token_embedding_table(idx)  # (B, T, C) : batch, time, channels
        
        if targets is None:
            loss = None
        else:
            # reshape logits since cross_entropy expects (B, C, T) inputs
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)  # equivalently, targets.view(-1)

            # negative log likelihood loss - calculates quality of our logits with respect to the true targets
            # a 'good' logit will have a high value in the target dimension and low values in other dimensions
            loss = F.cross_entropy(logits, targets)
        
        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        # the bigram only uses the last char as the context
        # we pass in the full context here as practice for generation using transformer
        for _ in range(max_new_tokens):
            # get predictions
            logits, loss = self(idx)  # calls the forward function
            # retrieve only final timestep
            logits = logits[:, -1, :] # (B, T, C) -> (B, C)
            # apply softmax to get probability distribution
            dist = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(dist, num_samples=1) # (B, 1)
            # append new sample to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T + 1)
        return idx


In [501]:
model = BigramLanguageModel(vocab_size)
logits, loss = model(xb, yb)
print(logits.shape)  # 4 batches, 8 timesteps, vocab_size channels
print(loss)

torch.Size([32, 87])
tensor(4.8334, grad_fn=<NllLossBackward0>)


In [502]:
torch.manual_seed(3)

def sample(context, new_tokens=100):
    print(f'Context: {decode(context[0].tolist())}')
    sample = model.generate(context, new_tokens)
    text = decode(sample[0].tolist())
    print(f'Sample: {text}')


# as the model's starting context for sampling, let's provide a newline character
blank_context = torch.tensor([encode('\n')])
sample(blank_context, 250)

Context: 

Sample: 
ПхчМнЭХ–г!,цЙщЫ,рАНвщiсЛжрЭтЗБИзняН—́’зЩЯщБги;ыРуШжмгЕЙСыТПУг–ПФЦырщп́ФЕпЫЧпо т!Их-фУЬюш–лрёФъЛшШi:пМш’̀Гб—мСЁМвєГчВ
Х̀Ю&ТЁбєЙлкiыубц
ЛИЭЫтущп́оНд:р:мвущфНдеєоЪЬЁІЙёЯэбИ?кІЙнЉк’є РЙ!Ип’м-оф&ЙЦкЭ’,Хп́каыцУуйНБ
Н’лЯШшРОящ
СБШОЮЁфу;—––тЯэ?ЁеЛзІыцфшАзП;є


The above sampled text is gibberish. Let's train the model so it can produce something that looks more reasonable.

In [503]:
# typical lr setting is 3e-4, but for small models we can use a much higher lr
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

In [504]:
batch_size = 32
num_steps = 10000
for step in range(num_steps):
    # sample a batch of data
    xb, yb = get_batch('train')
    
    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    
    if step == 0 or step == num_steps - 1:
        print(f'Step {step + 1}/{num_steps}: loss={loss.item()}')

Step 1/10000: loss=4.9720001220703125
Step 10000/10000: loss=2.5729243755340576


After optimization, let's see if we can sample something more reasonable.

In [507]:
sample(blank_context, 250)

Context: 

Sample: 
Го-ибедора зл скро, но? чуковши  побннонизмиведныеебой некаяЫЫЕй бебля — воси вистспрося, еле, дак ся, жеюц вероя бы жныстомуб бугле Пра. узал этудукахой Вые дил космяме, мегажапошесиемонаюм х а в ногого Г?
Ф-сстракат ушарать ро гомыцей я уде былетат


This is starting to look more like Russian text, but it is still pretty much gibberish. This is because the bigram predicts the next token only by looking at the last token in the context window. I.e. our 'context' is just one token. We're not learning any complex language patterns this way. With a transformer, we can enable the tokens to 'talk' to each other over longer ranges to learn more complex dependencies.