# Nano GPT

Nano GPT implementation by Andrej Karpathy.

* [nanoGPT](https://github.com/karpathy/nanoGPT)
* [Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY)
* [nanogpt-lecture](https://github.com/karpathy/ng-video-lecture)


In [1]:
%%html
<style>
table {float:left}
</style>

In [2]:
%load_ext autoreload
%autoreload 2


import inspect
import torch
from torch import nn
from torch.nn import functional as F
from bigram import (
    V,
    B,
    T,
    C,
    get_batch,
)

# Data

Using tinyshakespeare as the dataset

In [3]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-12-10 10:44:41--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.2’


2023-12-10 10:44:42 (7.24 MB/s) - ‘input.txt.2’ saved [1115394/1115394]



# Terminologies

* B: Batch size
* T: Time steps or Sequence length (e.g. 512 for bert input sequence)
* C: Channel or Feature (channel perhaps because Andrej is from CNN background?). ```C=2``` two features in each x.

## Batch Input

<img src="./image/gpt_batch.jpeg" align="left" width=750/>



In [4]:
print(inspect.getsource(get_batch))

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y



In [5]:
print(B,T,C)
x, y = get_batch('train')
x.shape

32 8 65


torch.Size([32, 8])

# Transformer

Transformer generates a graph network between position-encoded tokens.

1. Get un-connected tokens as a sequence (e.g. sentence)
2. Wires connections among tokens by having looked at the co-occurrances of them in billions of sequences.

## Q and K (Similarity Score)

For every token ```Q``` in a sequence, calculate the relation/communication with other token ```K``` in the sequence (for GPT, only previous tokens). This builds the graph network of Self Attention.

## V (Bow/Bag of Words)

One way to generate the inter-connections among the tokens to distill their knowledges or relations is ```BoW``` by averaging them feature-wise/axis=-1.

<img src="./image/self_attention.jpeg" align="left" width=700/>

### Similarity Score (Q & K)

|Similarity Score (Q & K)| Proabability as Softmax |
|---|---|
|<img src="./image/transformer_dot_product_attention_similarity_score.jpeg" align="left" width=500/>|<img src="./image/transformer_dot_product_attention.png" align="left" width=175/>|

In [6]:
# let's see a single Head perform self-attention
torch.manual_seed(1337)

# B: batch size
# T: time steps or number of tokens to iterate or sequencee size
# C: channels or embedding vector dimension or features
B,T,C = 4,8,32 # batch, time, channels
head_size = 16
    
Wk = nn.Linear(C, head_size, bias=False)
Wq = nn.Linear(C, head_size, bias=False)

def calculate_similarity_score(x):
    k = Wk(x)   # (B, T, head_size)
    q = Wq(x)   # (B, T, head_size)
    
    # First MatMul: (B, T, head_size) @ (B, head_size, T) ---> (B, T, T)
    score =  q @ k.transpose(-2, -1) 
    
    tril = torch.tril(torch.ones(T, T))
    score = score.masked_fill(tril == 0, float('-inf'))
    score = F.softmax(score, dim=-1)

    return score    # shape:(B, T, T)

In [7]:
x = torch.randn(B,T,C)

similarity_score = calculate_similarity_score(x)
similarity_score.shape

torch.Size([4, 8, 8])

### BoW (V)

<img src="./image/transformer_dot_product_attention_bow.png" align="left" width=700/>  

In [8]:
Wv = nn.Linear(C, head_size, bias=False)

def calculate_attension_value(score, x):
    v = Wv(x)            # (B,T,C) @ (C,head_size) -> (B,T,head_size)
    value = score @ v    # (B,T,T) @ (B,T,head_size) -> (B,T,head_size)

    return value         # (B,T,head_size)

In [10]:
attention_value = calculate_attension_value(similarity_score, x)
attention_value.shape

torch.Size([4, 8, 16])