In [9]:
with open('../input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    
print(text[:150])
print(f"Length of text: {len(text)}")

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

A
Length of text: 1115394


In [12]:
chars = sorted(list(set(text)))
n_vocab = len(chars)


print("".join(chars))
print(f"Vocabulary size: {n_vocab}")


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
Vocabulary size: 65


### Tokenizer

We have a custom tokenizer, its a character level tokenizer for the sake of simplicity

Some popular tokenizers includes tiktoken (byte pair encoding), sentencepiece (sub word unit encoding)

The above stated tokenizers have very large vocabulary (~50k tokens) but this results in much smaller sequences

in our case the char level token has only 65 tokens so the resulting sequence will be a one to one mapping of each character and length of sequence will scale linearly (which is bad)

> TODO use one of the popular tokenizers later while implementing to see the difference

In [18]:
char2idx = { ch: i for i, ch in enumerate(chars) }
idx2char = { i: ch for i, ch in enumerate(chars) }

encode = lambda string: [char2idx[char] for char in string]
decode = lambda tensor: "".join([idx2char[idx] for idx in tensor])

print(encode("hii there"))
print(decode(encode("hii there")))


[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


### Tokenize the dataset

In [21]:
import torch

data = torch.tensor(encode(text), dtype=torch.long)

print(data.shape, data.dtype)
print(data[:100])
print(decode(data[:100].tolist()))

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])
First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


### Train - Validate Split

In [23]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

print(len(train_data), len(val_data))

1003854 111540


### hyperparameters

`block_size`

> we train the transformer on the above dataset as chunks, feeding in the entire dataset at once would be too computationally expensive, so we ranomly sample "chunks" of sequences from the dataset and train on them. The length of this sampled sequence is determined by block_size

`n_vocab`

> length of vocabulary, vocabulary is basically the number of unique tokens that our transformer will see and generate

In [24]:
block_size = 8

In one of these sequences, there are multiple examples packed in it. in a sequence of length 8 there are 8 unique training examples

as such the `+1` is to accomodate a `y` for the last training sample, since `y` starts at an offset of `+1`

### Note

> The reason why multiple training samples are taken from a single sequence ranging from `1 - block_size` is not just to make it computationally efficient but to get the transformer used to seeing sequences of length in that range. `block_size` is essentially the `context_length` in transformers. During generation as well, when we keep appending generated tokens and during the next forward pass the transformer only sees the last `block_size` tokens

In [26]:
x = train_data[:block_size]
y = train_data[1:block_size + 1]

print(decode(x.tolist()))
print(x, y)

for t in range(block_size):
    context = x[:t + 1]
    target = y[t]
    
    print(f"when input in: {context} the target: {target}")

First Ci
tensor([18, 47, 56, 57, 58,  1, 15, 47]) tensor([47, 56, 57, 58,  1, 15, 47, 58])
when input in: tensor([18]) the target: 47
when input in: tensor([18, 47]) the target: 56
when input in: tensor([18, 47, 56]) the target: 57
when input in: tensor([18, 47, 56, 57]) the target: 58
when input in: tensor([18, 47, 56, 57, 58]) the target: 1
when input in: tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input in: tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input in: tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


In [None]:
def get_batch():
    pass