# GPT From Scratch

Based on theory from the paper titled ['Attention is all you need'](https://arxiv.org/pdf/1706.03762).

The **dataset** used in this project : [Tiny Shakespeare](https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt) (the totality of all the words written by Shakespeare).

The **goal** of this model is predicting the next token (subword in our case) based on the given context.

In [64]:
# all the necessary imports in one place
import wget

import numpy as np

import torch
import torch.nn as nn
from torch.nn import functional as F

In [9]:
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
wget.download(url)

100% [......................................................] 1115394 / 1115394

'input (1).txt'

## Importing and Preprocessing necessary data

In [10]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [13]:
print(text[:300])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us


In [11]:
print('Length of dataset in characters:', len(text))

Length of dataset in characters: 1115394


Let's obtain all the unique characters in our dataset (sorted) <br>
(when printing the ' 's were placed to see that the first element is a blank space character, which is also considered a character in the vocabulary)

### Tokenization

To keep things simple we are using a character-level tokenizer. But real GPTs mainly use more advanced tokenizers such as BPE

In [56]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
print('All the characters present in the dataset: ', "' '".join(chars))
print('Vocabulary Length : ', len(chars))

All the characters present in the dataset:  
' ' ' '!' '$' '&' ''' ',' '-' '.' '3' ':' ';' '?' 'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' 'X' 'Y' 'Z' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z
Vocabulary Length :  65


### Vectorization

In this section we find a way to turn our string into an integer representation, and we define 2 crucial components of the GPT:
- Encoder : takes a string and outputs a list of integers
- Decoder : takes a list of integers and outputs corresponding string

In [20]:
stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}

encoder = lambda seq: [stoi[char] for char in seq]
decoder = lambda int_list: ''.join([itos[i] for i in int_list])

In [24]:
print(encoder('hello there'))
print(decoder([46, 43, 50, 50, 53, 1, 58, 46, 43, 56, 43]))

[46, 43, 50, 50, 53, 1, 58, 46, 43, 56, 43]
hello there


Lets apply this to the entire dataset:

In [32]:
data = torch.tensor(encoder(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


Let's split the data into train and validation sets:
- 90% train
- 10% validation

In [35]:
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

Key point: we are not going to feed all the data to a transformer at one, since it would be too computationally expensive. We sample random chunks of data that have a max length (block_size) and work with them.

In [36]:
block_size = 8
train_data[:block_size + 1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

Given a block_size the model will have to make 8 predictions, so in our example: <br>
given 18, predict 47, <br>
given 18, 47, predict 56 etc. <br>

In [43]:
x = train_data[:block_size]
y = train_data[1:block_size + 1]

for t in range(block_size): # t for time dimension
    context = x[:t+1]
    target = y[t]
    print(f'When input is {context} the target: {target}')

When input is tensor([18]) the target: 47
When input is tensor([18, 47]) the target: 56
When input is tensor([18, 47, 56]) the target: 57
When input is tensor([18, 47, 56, 57]) the target: 58
When input is tensor([18, 47, 56, 57, 58]) the target: 1
When input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
When input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
When input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


The subtility of creating the context in this way is making the transformer see not only the last word and the target, but also the words that came before it, forming a context.<br><br>
We will also have batches of certain dimensions containing multiple chunks of data.

In [44]:
torch.manual_seed(1337)
BATCH_SIZE = 4
BLOCK_SIZE = 8

In [47]:
def get_batch(split):
    # generate a batch of data of inputs x and target y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - BLOCK_SIZE, (BATCH_SIZE, ))
    x = torch.stack([data[i:i + BLOCK_SIZE] for i in ix])
    y = torch.stack([data[i + 1:i + BLOCK_SIZE + 1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:', xb.shape)
print(xb)
print('targets:', yb.shape)
print(yb)

inputs: torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets: torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])


In [49]:
for b in range(BATCH_SIZE):
    for t in range(BLOCK_SIZE):
        context = xb[b, :t+1]
        target = yb[b, t]
        print(f"when input is {context.tolist()} the target is {target}")

when input is [24] the target is 43
when input is [24, 43] the target is 58
when input is [24, 43, 58] the target is 5
when input is [24, 43, 58, 5] the target is 57
when input is [24, 43, 58, 5, 57] the target is 1
when input is [24, 43, 58, 5, 57, 1] the target is 46
when input is [24, 43, 58, 5, 57, 1, 46] the target is 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target is 39
when input is [44] the target is 53
when input is [44, 53] the target is 56
when input is [44, 53, 56] the target is 1
when input is [44, 53, 56, 1] the target is 58
when input is [44, 53, 56, 1, 58] the target is 46
when input is [44, 53, 56, 1, 58, 46] the target is 39
when input is [44, 53, 56, 1, 58, 46, 39] the target is 58
when input is [44, 53, 56, 1, 58, 46, 39, 58] the target is 1
when input is [52] the target is 58
when input is [52, 58] the target is 1
when input is [52, 58, 1] the target is 58
when input is [52, 58, 1, 58] the target is 46
when input is [52, 58, 1, 58, 46] the target is 39
w

Before creating the GPT, we will first try the dataset on lighter models.

#### Bigram Model

We'll start with the Bigram model, which predicts the next word based on the previous one.

In [73]:
class BigramLanguageModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        # each token reads off the logits for the next token directly from the lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets):
        # idx and targets are both (B, T) tensors of integers
        logits = self.token_embedding_table(idx) # scores for next character in sequence of size (B, T, C)

        # modifications of shape since cross_entropy expects to see another shape
        B, T, C = logits.shape
        logits = logits.view(B * T, C) 
        targets = targets.view(B * T)
        
        loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            logits, _ = self(idx)
            logits = logits[:, -1, :] # make it (B, C)
            probs = F.softmax(logits, dim=1) # applying softmax to get probas
            idx_next = torch.multinomial(probs, num_samples=1) # shape (B, 1)
            idx = torch.cat((idx, idx_next), dim=1)
            
            return idx

In [74]:
bigram_model = BigramLanguageModel(vocab_size)
logits, loss = bigram_model(xb, yb)

print(logits.shape)
print(loss)

TypeError: cannot unpack non-iterable NoneType object