<a href="https://colab.research.google.com/github/joshmcadams/nanoGPT/blob/main/nanoGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# nanoGPT

nanoGPT is a wonderful exploration into building a GPT model from scratch. We'll start with an empty notebook and quickly build and train a model to generate Shakespeare-like text.

This notebook is based off of a code and videos releasted by Andrej Karpathy. Please be sure to check out his work!

  * [Reference Video Lecture](https://www.youtube.com/watch?v=kCc8FmEb1nY)
  * [Reference Git Repository](https://github.com/karpathy/ng-video-lecture)

# Data Exploration and Preparation

## Acquiring the data

We'll first pull the data down into this lab. To get the data the `wget` command will be used in the shell.

In [1]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-05-11 18:18:33--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-05-11 18:18:33 (15.1 MB/s) - ‘input.txt’ saved [1115394/1115394]



Next we'll load the data into Python and take a look at a sample.

First, we simply open the file and store the entire contents of the file in the variable `raw_training_data`. We get the length of the data and print that out so that we can see how much data we are dealing with.

In [2]:
with open('input.txt', 'r', encoding='utf-8') as f:
    raw_training_data = f.read()

training_data_size = len(raw_training_data)
print(training_data_size)

1115394


Next we sample the data just to get an idea of what we are dealing with. The sample comes from a random locaiton in the data. Run the code block below a few times to see some different data samples.

In [3]:
import random

SAMPLE_SIZE = 500
start = random.randrange(0, training_data_size - SAMPLE_SIZE)
print(raw_training_data[start:start+SAMPLE_SIZE])

, ere I go, Hastings and Montague,
Resolve my doubt. You twain, of all the rest,
Are near to Warwick by blood and by alliance:
Tell me if you love Warwick more than me?
If it be so, then both depart to him;
I rather wish you foes than hollow friends:
But if you mind to hold your true obedience,
Give me assurance with some friendly vow,
That I may never have you in suspect.

MONTAGUE:
So God help Montague as he proves true!

HASTINGS:
And Hastings as he favours Edward's cause!

KING EDWARD IV:
No


## Data Analysis

We've now acquired the data and even poked around a bit to see what the data looks like. However, we need to dig deeper and do some more detailed analysis on the data.

One important fact to know about data that will be fed into a generative model is how many different characters are we working with. This is actually really easy in Python and can be accomplished using the `set` function.

In [4]:
tokens = sorted(list(set(raw_training_data)))
token_count = len(tokens)

print(f'There are {token_count} unique characters in the {training_data_size}' +
      f'of training data. The characters are: {"".join(tokens)}')

There are 65 unique characters in the 1115394of training data. The characters are: 
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz


## Encoding

We now know how many different unique tokens exist in our training data. In our case, these tokens will be individual characters, but this a choice that we have made. We could have alternatively chosen pairs, triplets, or some other combination of characters to be a token.

Regardless of the data that comprises our tokens, the token itself isn't what we feed to the model. Instead, we need to convert the token into a numeric representation since models are performing numeric calculations internally.

For our character-based tokens we could just use the ASCII value of the character using the `ord` function. However, you might eventually want to expermiment with sequences of characters as tokens, and in this case you'll need a more complex token-to-number mapping, so lets just create a token-to-number mapping scheme now.

Since we ordered our unique tokens into the `tokens` list, we can just use the index of the token in the list as the encoding, which is what we do.

Note that this does come with some trade-offs though. The position of any given token in the list of tokens is relative to how many and what were the unique tokens in the training data. If we use different trianing data, the tokens might map to different indexes so we need to be careful to preserve our tokens across training data set if we are using mulitple different pieces of training data.

Anyway, let's write a token-to-number `encode` function and a number-to-token `decode` function.

First, create a mapping of tokens to numbers.

In [5]:
token_to_number = {t: i for i, t in enumerate(tokens)}
token_to_number

{'\n': 0,
 ' ': 1,
 '!': 2,
 '$': 3,
 '&': 4,
 "'": 5,
 ',': 6,
 '-': 7,
 '.': 8,
 '3': 9,
 ':': 10,
 ';': 11,
 '?': 12,
 'A': 13,
 'B': 14,
 'C': 15,
 'D': 16,
 'E': 17,
 'F': 18,
 'G': 19,
 'H': 20,
 'I': 21,
 'J': 22,
 'K': 23,
 'L': 24,
 'M': 25,
 'N': 26,
 'O': 27,
 'P': 28,
 'Q': 29,
 'R': 30,
 'S': 31,
 'T': 32,
 'U': 33,
 'V': 34,
 'W': 35,
 'X': 36,
 'Y': 37,
 'Z': 38,
 'a': 39,
 'b': 40,
 'c': 41,
 'd': 42,
 'e': 43,
 'f': 44,
 'g': 45,
 'h': 46,
 'i': 47,
 'j': 48,
 'k': 49,
 'l': 50,
 'm': 51,
 'n': 52,
 'o': 53,
 'p': 54,
 'q': 55,
 'r': 56,
 's': 57,
 't': 58,
 'u': 59,
 'v': 60,
 'w': 61,
 'x': 62,
 'y': 63,
 'z': 64}

And then a mapping from numbers to tokens.

In [6]:
number_to_token = {i: t for i, t in enumerate(tokens)}
number_to_token

{0: '\n',
 1: ' ',
 2: '!',
 3: '$',
 4: '&',
 5: "'",
 6: ',',
 7: '-',
 8: '.',
 9: '3',
 10: ':',
 11: ';',
 12: '?',
 13: 'A',
 14: 'B',
 15: 'C',
 16: 'D',
 17: 'E',
 18: 'F',
 19: 'G',
 20: 'H',
 21: 'I',
 22: 'J',
 23: 'K',
 24: 'L',
 25: 'M',
 26: 'N',
 27: 'O',
 28: 'P',
 29: 'Q',
 30: 'R',
 31: 'S',
 32: 'T',
 33: 'U',
 34: 'V',
 35: 'W',
 36: 'X',
 37: 'Y',
 38: 'Z',
 39: 'a',
 40: 'b',
 41: 'c',
 42: 'd',
 43: 'e',
 44: 'f',
 45: 'g',
 46: 'h',
 47: 'i',
 48: 'j',
 49: 'k',
 50: 'l',
 51: 'm',
 52: 'n',
 53: 'o',
 54: 'p',
 55: 'q',
 56: 'r',
 57: 's',
 58: 't',
 59: 'u',
 60: 'v',
 61: 'w',
 62: 'x',
 63: 'y',
 64: 'z'}

And now we can write our encoder.

In [7]:
def encode(tokens: str) -> list[int]:
  return [token_to_number[t] for t in tokens]

encode('apple')

[39, 54, 54, 50, 43]

And the decoder.

In [8]:
def decode(numbers: list[int]) -> str:
  return ''.join(number_to_token[n] for n in numbers)

decode([16, 17, 18])

'DEF'

We now have a tokenizer and a large amount of text. Let's tokenize all the text!

[Video Reference Point](https://youtu.be/kCc8FmEb1nY?t=778)

In [10]:
import torch
data = torch.tensor(encode(raw_training_data), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:100])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


Now we need to split training data and testing data.

In [17]:
training_data_size = int(0.9*len(data))
training_data = data[:training_data_size]
validation_data = data[training_data_size:]
print(len(training_data), len(validation_data))

1003854 111540


## Time Dimension

Next we work on training in blocks.
In this case we train in blocks of 8 characters, but we always need the "next" character, so we work with block sizes of bock+1.

[Video Reference Point](https://youtu.be/kCc8FmEb1nY?t=913)

In [18]:
block_size = 8
training_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [19]:
x = training_data[:block_size]
y = training_data[1:block_size+1]
for t in range(block_size):
  context = x[:t+1]
  target = y[t]
  print(f"in: {context}, out: {target}")

in: tensor([18]), out: 47
in: tensor([18, 47]), out: 56
in: tensor([18, 47, 56]), out: 57
in: tensor([18, 47, 56, 57]), out: 58
in: tensor([18, 47, 56, 57, 58]), out: 1
in: tensor([18, 47, 56, 57, 58,  1]), out: 15
in: tensor([18, 47, 56, 57, 58,  1, 15]), out: 47
in: tensor([18, 47, 56, 57, 58,  1, 15, 47]), out: 58


## Batch Dimension

[Video Reference Point](https://youtu.be/kCc8FmEb1nY?t=1129)

In [20]:
torch.manual_seed(1337)

batch_size = 4
block_size = 8

def get_batch(split):
  data = training_data if split == 'train' else validation_data
  ix = torch.randint(len(data) - block_size, (batch_size,))
  x = torch.stack([data[i:i+block_size] for i in ix])
  y = torch.stack([data[i+1:i+block_size+1] for i in ix])
  return x,y

xb, yb = get_batch('train')
print('inputs')
print(xb.shape)
print(xb)
print('targets')
print(yb.shape)
print(yb)
print('----------------------------------')

for b in range(batch_size):
  for t in range(block_size):
    context = xb[b, :t+1]
    target = yb[b,t]
    print(f'input {context.tolist()}, target {target}')

inputs
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----------------------------------
input [24], target 43
input [24, 43], target 58
input [24, 43, 58], target 5
input [24, 43, 58, 5], target 57
input [24, 43, 58, 5, 57], target 1
input [24, 43, 58, 5, 57, 1], target 46
input [24, 43, 58, 5, 57, 1, 46], target 43
input [24, 43, 58, 5, 57, 1, 46, 43], target 39
input [44], target 53
input [44, 53], target 56
input [44, 53, 56], target 1
input [44, 53, 56, 1], target 58
input [44, 53, 56, 1, 58], target 46
input [44, 53, 56, 1, 58, 46], target 39
input [44, 53, 56, 1, 58, 46, 39], target 58
input [44, 53, 56, 1, 58, 46, 39, 58], target 1
input [52], tar

# BiGram

[Video Reference Point](https://youtu.be/kCc8FmEb1nY?t=1348)

[Makemore Series covering the BiGram language model](https://www.youtube.com/watch?v=PaCmpygFfXo)

In [23]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

  def __init__(self, vocab_size):
    super().__init__()
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

  def forward(self, idx, targets):
    logits = self.token_embedding_table(idx)

    B, T, C = logits.shape
    logits = logits.view(B*T, C)
    targets = targets.view(B*T)
    loss = F.cross_entropy(logits, targets)

    return logits, loss

m = BigramLanguageModel(token_count)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)


In [27]:
import numpy as np

print(-np.log(1/65.0))

4.174387269895637
