# GPT - Part 1: Biogram
- Video: [Andrej Karpathy - Let's build GPT](https://www.youtube.com/watch?v=kCc8FmEb1nY&t=1413s)
- Papers
    - [Attention is All You Need paper](https://arxiv.org/abs/1706.03762)
    - [OpenAI GPT-3 paper](https://arxiv.org/abs/2005.14165) 
    - [OpenAI ChatGPT blog post](https://openai.com/blog/chatgpt/)

In [1]:
# core libraries
import torch
import torch.nn.functional as F
import torch.nn as nn

import numpy as np
import pandas as pd
import pyarrow as pa
import random
import seaborn as sns
import matplotlib.pyplot as plt

# matpolitlib config
%matplotlib inline

In [2]:
# !wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [3]:
# read all the words
with open('./source/input.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    
len(text)

1115394

## Simple encoder

Common example of more sofisticated encoders are:
- Google uses [SentencePiece](https://github.com/google/sentencepiece). SentencePiece implements subword units
  (e.g., byte-pair-encoding (BPE) [Sennrich et al.]) and unigram language model [Kudo.]).
- OpenAi uses [tiktoken](https://github.com/openai/tiktoken). tiktoken is a fast BPE tokeniser for use with
  OpenAI's models. Example code using tiktoken can be found in the [OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb). 

In [4]:
# build the vocabulary of characters and mappings to/from integers
chars = sorted(list(set(text)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])
vocab_size = len(chars)
vocab_size

65

In [5]:
print(encode('This house'))
print(decode(encode('This house')))

[32, 46, 47, 57, 1, 46, 53, 59, 57, 43]
This house


### Bonus example using tiktokenizer (OpenAI)

In [6]:
import tiktoken

enc = tiktoken.get_encoding('gpt2')
print(enc.n_vocab)
print(enc.encode("The quick brown fox"))
print(enc.decode(enc.encode("The quick brown fox")))

enc = tiktoken.get_encoding('cl100k_base')
print(enc.n_vocab)
print(enc.encode("The quick brown fox"))
print(enc.decode(enc.encode("The quick brown fox")))

50257
[464, 2068, 7586, 21831]
The quick brown fox
100277
[791, 4062, 14198, 39935]
The quick brown fox


## Set the initial values
#### encode the text as tokens

In [7]:
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.type())
print(data[:10])

torch.Size([1115394]) torch.LongTensor
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47])


#### split the data to avoid overfitting

In [8]:
# split the data
n = int(0.9*len(data))
train_data = data[n:]
val_data = data[:n]

#### **block_size** is the fixed lenght blocks of data
- what is the minimum context length for predictions?
- how many characters do we look back on to predict the next

In [9]:
block_size = 8

#### **batch_size** is the number of sequential block sizes to run in parallel
- how many independant sequences will we process in parallel
- how many forward and backward passes in the training. Torch sorts the parallelism.

In [10]:
batch_size = 4

#### set the global random seed for selecting batches from the text
This is only to repeat same results

In [11]:
torch.manual_seed(1337)

<torch._C.Generator at 0x7fc617344690>

## Get the batch split
all the batch from 1 to batch_size (x) and all their targets (y)

In [12]:
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i: i+block_size] for i in ix])
    y = torch.stack([data[i+1: i+block_size+1] for i in ix])
    return x, y

# get a sigle train block
xb, yb = get_batch('train')

#### what we have for each block
- block size of 4 by 8
- note the target is the previous contect together as above
- [1, 47] together in this order has a target 57
- likewise [1, 47, 57] has a target 1

In [13]:
print('inputs')
print(xb.shape)
print(xb)
print('targets')
print(yb.shape)
print(yb)
print('----------')

inputs
torch.Size([4, 8])
tensor([[ 6,  1, 52, 53, 58,  1, 58, 47],
        [ 6,  1, 54, 50, 39, 52, 58, 43],
        [ 1, 58, 46, 47, 57,  1, 50, 47],
        [ 0, 32, 46, 43, 56, 43,  1, 42]])
targets
torch.Size([4, 8])
tensor([[ 1, 52, 53, 58,  1, 58, 47, 50],
        [ 1, 54, 50, 39, 52, 58, 43, 58],
        [58, 46, 47, 57,  1, 50, 47, 60],
        [32, 46, 43, 56, 43,  1, 42, 53]])
----------


#### what we have for one batch block
- batch size of 8 (time) the block size of 4 
- this is an indepentant batch over time for 4 repeating blocks so we don't do this for all text when training
- note the target represents the context tokens in this order and this size only
- 6 followed by 1 followed by 52 has a probability that the next token will be 53
- the target comes in at the end (ouput layer) to indicate the loss (how far were we from getting it right)

In [14]:
for b in range(batch_size): # batch timention
    for t in range(block_size): # time dimention
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()}    the target is {target}")

when input is [6]    the target is 1
when input is [6, 1]    the target is 52
when input is [6, 1, 52]    the target is 53
when input is [6, 1, 52, 53]    the target is 58
when input is [6, 1, 52, 53, 58]    the target is 1
when input is [6, 1, 52, 53, 58, 1]    the target is 58
when input is [6, 1, 52, 53, 58, 1, 58]    the target is 47
when input is [6, 1, 52, 53, 58, 1, 58, 47]    the target is 50
when input is [6]    the target is 1
when input is [6, 1]    the target is 54
when input is [6, 1, 54]    the target is 50
when input is [6, 1, 54, 50]    the target is 39
when input is [6, 1, 54, 50, 39]    the target is 52
when input is [6, 1, 54, 50, 39, 52]    the target is 58
when input is [6, 1, 54, 50, 39, 52, 58]    the target is 43
when input is [6, 1, 54, 50, 39, 52, 58, 43]    the target is 58
when input is [1]    the target is 58
when input is [1, 58]    the target is 46
when input is [1, 58, 46]    the target is 47
when input is [1, 58, 46, 47]    the target is 57
when input i

## Build a simple Neural Network

In [15]:
class BiogramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):
        logits = self.token_embedding_table(idx)  # (Batch Time Container)
        if targets is None:
            loss = None
        else:
            # torch cross_entropt expects B C T if three dimentional so make it 2 dimentions
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            # targets to match logits
            targets = targets.view(B*T)
            # cross_entropy of output (logits) against target labels
            loss = F.cross_entropy(logits, targets)
        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is the (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # use softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # returns (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # returns (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # returns (B, T+1)
        return idx


m = BiogramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))
        

torch.Size([32, 65])
tensor(4.7895, grad_fn=<NllLossBackward0>)

l-QYjt'CL?jLDuQcLzy'RIo;'KdhpV
vLixa,nswYZwLEPS'ptIZqOZJ$CA$zy-QTkeMk x.gQSFCLg!iW3fO!3DGXAqTsq3pdgq


## Train the model
#### create the optimizer
- Before we used Stocastic Gradient Decent, which is a very simple optimizer.
- Now we are going to use Adam
    - considered the best optimizer
    - learning rate best at 3e-4
    - simpler networks we can get away with higher learning rates

In [16]:
# optimizer = torch.optim.SGD(m.parameters(), lr=1e-3)
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

#### typical training loop
- reset the batch_size to 32

In [17]:
batch_size = 32

for steps in range(10000):
    # sample a batch of data
    xb, yb = get_batch('train')
    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

    # track stats
    if steps % 1000 == 0: 
        print(f'{loss.item():.4f}')

    

4.6662
3.6984
3.0699
2.7992
2.5873
2.5084
2.3623
2.3209
2.4716
2.3864


#### optimizer results
- Better looking results than 1 run
- Not great because using the simples type of model

In [18]:
print(decode(m.generate(torch.zeros((1, 1), dtype=torch.long), max_new_tokens=200)[0].tolist()))


Ong h hasbe pave pirance
GRO:
Bagathathar's we!
PeKAd ith henoangincenonthioneir thoniteay heltieiengerofo'PTIsit ey
KANld pe wisher ve pllouthers nowl t,
Kay ththind tt hinio t ouchos tes; sw yo hind
