## Contents
1. Generate a single toy batch from the tiny shakespeare dataset
1. Overfit on that batch

In [1]:
from gpt2 import GPT, GPTConfig # our GPT class
import tiktoken
import torch
import torch.nn as nn
import torch.nn.functional as F

In [2]:
enc = tiktoken.get_encoding('gpt2')
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # dynamic device

## Generating a batch

In [3]:
# read input data text file
with open('data/input.txt', 'r') as file:
    data = file.read().replace('\n', '')

tokens = enc.encode(data)
print(f'Length of tokens: {len(tokens)}')

B, T = 4, 6
x = torch.tensor(tokens[:B*T]).reshape(B, T)
y = torch.tensor(tokens[1:B*T+1]).reshape(B, T)
print(f'input: \n{x}')
print(f'target: \n{y}')

Length of tokens: 297884
input: 
tensor([[ 5962, 22307,    25,  8421,   356,  5120],
        [  597,  2252,    11,  3285,   502,  2740],
        [   13,  3237,    25,  5248,   461,    11],
        [ 2740,    13,  5962, 22307,    25,  1639]])
target: 
tensor([[22307,    25,  8421,   356,  5120,   597],
        [ 2252,    11,  3285,   502,  2740,    13],
        [ 3237,    25,  5248,   461,    11,  2740],
        [   13,  5962, 22307,    25,  1639,   389]])


## Overfitting on a single batch

In [4]:
model = GPT(GPTConfig).to(device)

In [5]:
B, T = 4, 32
x = torch.tensor(tokens[:B*T]).reshape(B, T).to(device)
y = torch.tensor(tokens[1:B*T+1]).reshape(B, T).to(device)

model.train();
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
for i in range(100):
    optimizer.zero_grad()
    logits, loss = model(x, y)
    loss.backward()
    optimizer.step()
    print(f'Loss at iteration {i}: {loss.item()}')

Loss at iteration 0: 11.043309211730957
Loss at iteration 1: 8.613590240478516
Loss at iteration 2: 10.750606536865234
Loss at iteration 3: 8.073920249938965
Loss at iteration 4: 7.72338342666626
Loss at iteration 5: 7.411949157714844
Loss at iteration 6: 6.878050327301025
Loss at iteration 7: 6.415774345397949
Loss at iteration 8: 5.869007110595703
Loss at iteration 9: 5.304551601409912
Loss at iteration 10: 4.673495769500732
Loss at iteration 11: 4.278286457061768
Loss at iteration 12: 4.2021050453186035
Loss at iteration 13: 3.4737212657928467
Loss at iteration 14: 3.127682685852051
Loss at iteration 15: 2.6242685317993164
Loss at iteration 16: 2.2600083351135254
Loss at iteration 17: 1.910429835319519
Loss at iteration 18: 1.6248867511749268
Loss at iteration 19: 1.3340888023376465
Loss at iteration 20: 1.0824331045150757
Loss at iteration 21: 0.8567084074020386
Loss at iteration 22: 0.6526219844818115
Loss at iteration 23: 0.4932633936405182
Loss at iteration 24: 0.386574059724807

Why does not tying the weights make the loss only decrease up to around `1.0`? Removing the weight tying allows the loss to go well below `1.0`.