To be able to use my already existing python environment, I had to give Visual Studio Code the path to my environments folder. 

In [2]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

print(f"Length of dataset: {len(text)} characters.")

# There are a total of 65 unique characters in the dataset.
chars = sorted(list(set(text)))
print(len(chars))
print("".join(chars))

# We will tokenize our vocabulary by building a character level language model. We will represent each
# character as an integer. Sub-word tokenizers are also possible (chat-gpt uses tiktoken)
# We first create a mapping from characters to integers using a dictionary
chtoi = {ch:i for i,ch in enumerate(chars)}
itoch = {i:ch for i,ch in enumerate(chars)}

def encode(s):  
    return [chtoi[ch] for ch in s] # Take a string, output list of integers.

def decode(list_int):
    return "".join([itoch[i] for i in list_int]) # Take a list of integers, output string.

Length of dataset: 1115394 characters.
65

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


In [3]:
# We now encode entire "input.txt" and save it in a torch tensor.
import torch
data = torch.tensor(encode(text))

n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

When we train a transformer, we only work with random chunks we take from the dataset. 

In a chunk of 9 characters, there are 8 training examples of increasing context length. Maximum context length we train with is given by block_size. This is useful for inference as the transformer is used to working with varying context lengths. For inference, we have to divide inputs larger than block_size into chunks. 

In [4]:
block_size = 8

print("CONTEXT")
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"When {context} is the context, the target is {target}.")

CONTEXT
When tensor([18]) is the context, the target is 47.
When tensor([18, 47]) is the context, the target is 56.
When tensor([18, 47, 56]) is the context, the target is 57.
When tensor([18, 47, 56, 57]) is the context, the target is 58.
When tensor([18, 47, 56, 57, 58]) is the context, the target is 1.
When tensor([18, 47, 56, 57, 58,  1]) is the context, the target is 15.
When tensor([18, 47, 56, 57, 58,  1, 15]) is the context, the target is 47.
When tensor([18, 47, 56, 57, 58,  1, 15, 47]) is the context, the target is 58.


In [5]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for prediction?

def get_batch(split):
    """
    We obtain a context and target tensor of size (batch_size, block_size)
    """
    data = train_data if split=="train" else val_data
    ix = torch.rand_int(low=0, high=len(data)-block_size, size=batch_size)

    # We now turn horizontally
    X = torch.hstack([data[i:i+block_size] for i in ix])
    Y = torch.hstack([data[i+1:i+block_size+1] for i in ix])

    return X,Y

# BIGRAM

Bigrams are a very simple model. They simply use a look-up table and no context. They use only the current character to predicth the next. 

The objective of the generate() function is to extend the (batch_size, block_size) horizontally and predict more tokens. Gets (B,T) -> (B,T+1)

min 38