<a href="https://colab.research.google.com/github/mathschelsea/data_science/blob/feature%2Fllm_underthehood/notebooks/llm_underthehood.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [77]:
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(7663);

In [10]:
# Downloadeding Bill & Ted's Bogus Journey
!wget https://raw.githubusercontent.com/mathschelsea/data_science/feature/llm_underthehood/data/raw/bandt_bogus_script.txt

--2023-10-05 18:41:25--  https://raw.githubusercontent.com/mathschelsea/data_science/feature/llm_underthehood/data/raw/bandt_bogus_script.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25535 (25K) [text/plain]
Saving to: ‘bandt_bogus_script.txt’


2023-10-05 18:41:25 (21.3 MB/s) - ‘bandt_bogus_script.txt’ saved [25535/25535]



In [19]:
# Reading in the script
with open('bandt_bogus_script.txt', 'r', encoding='utf-8') as f:
  text = f.read()

In [20]:
print(f'Length of dataset in characters: {len(text)}')

Length of dataset in characters: 24643


In [22]:
# Looking at first 1000 characters
print(text[:1000])

It is time. 
They've reached the second crucial turning point in their destiny. Their message is about to 
reach millions. 
But... we will change all that. When our mission is successful... ...no longer will the world 
be dominated... 
...by the legacy of these two fools! No longer will we hear this: 
We will stop them now! 
Brothers and sisters... 
...are we ready? 
- Greetings, my excellent pupils. - Station. 
Let's continue our study of the physics of acoustical reverberation. Meet today's most non-bogus 
guest speakers. 
Say hello to Thomas Edison. 
Hello there. 
To help us on the musical side: Johann Sebastian Bach. 
And Sir James Martin 
of Faith No More... 
...founder of the Faith No More Spiritual and Theological Center. - Station! 
- Station! 
And a special treat 
from the 23rd century: 
Miss Ria Paschelle. 
Miss Paschelle is 
the inventor of the... 
...statiophonicoxygeneticamp 
lifiagraphiphonideliverberator. Hard to imagine the world 
without them, isn't it? 
Remember, this

In [25]:
# Listing all the unique characters in the text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(f'Vocabulary size: {vocab_size}')
print(f'All unique characters: {"".join(chars)}')

Vocabulary size: 67
All unique characters: 
 !"$',-.0123567:?ABCDEFGHIJKLMNOPQRSTUWYabcdefghijklmnopqrstuvwxyz


In [None]:
# Developing a strategy to tokenise the text i.e. convert the raw string text to some sequence of integers

In [26]:
# Creating a mapping from characters to integers and back again

# Encode (string to index)
stoi = { ch:i for i,ch in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]

# Decode (index to string)
itos = { i:ch for i,ch in enumerate(chars)}
decode = lambda l: ''.join([itos[i] for i in l])

print(encode('dude'))
print(decode(encode('dude')))

[44, 61, 44, 45]
dude


In [31]:
# sub-word tokeniser: SentencePiece
# GPT uses TikToken BPE tokeniser

In [34]:
# Encode the entire dataset and store it into a torch tensor
data = torch.tensor(encode(text), dtype=torch.long)
print(f'Shape: {data.shape}')
print(f'Data type: {data.dtype}')
print(data[:1000])

Shape: torch.Size([24643])
Data type: torch.int64
tensor([26, 60,  1, 49, 59,  1, 60, 49, 53, 45,  8,  1,  0, 37, 48, 45, 65,  5,
        62, 45,  1, 58, 45, 41, 43, 48, 45, 44,  1, 60, 48, 45,  1, 59, 45, 43,
        55, 54, 44,  1, 43, 58, 61, 43, 49, 41, 52,  1, 60, 61, 58, 54, 49, 54,
        47,  1, 56, 55, 49, 54, 60,  1, 49, 54,  1, 60, 48, 45, 49, 58,  1, 44,
        45, 59, 60, 49, 54, 65,  8,  1, 37, 48, 45, 49, 58,  1, 53, 45, 59, 59,
        41, 47, 45,  1, 49, 59,  1, 41, 42, 55, 61, 60,  1, 60, 55,  1,  0, 58,
        45, 41, 43, 48,  1, 53, 49, 52, 52, 49, 55, 54, 59,  8,  1,  0, 19, 61,
        60,  8,  8,  8,  1, 63, 45,  1, 63, 49, 52, 52,  1, 43, 48, 41, 54, 47,
        45,  1, 41, 52, 52,  1, 60, 48, 41, 60,  8,  1, 39, 48, 45, 54,  1, 55,
        61, 58,  1, 53, 49, 59, 59, 49, 55, 54,  1, 49, 59,  1, 59, 61, 43, 43,
        45, 59, 59, 46, 61, 52,  8,  8,  8,  1,  8,  8,  8, 54, 55,  1, 52, 55,
        54, 47, 45, 58,  1, 63, 49, 52, 52,  1, 60, 48, 45,  1, 63, 55

In [38]:
# Splitting data into train and validation sets
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

In [39]:
# We now want to start plugging these integer (text) sequences into the transformer so that it can be trained
# We can't push the entire text through the transformed all at once, we need to push 'chunks' of the text through

In [42]:
block_size = 8 # context length
train_data[:block_size+1]

tensor([26, 60,  1, 49, 59,  1, 60, 49, 53])

In [45]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
  context = x[:t+1]
  target = y[t]
  print(f'When the input is {context} the target is {target}')

When the input is tensor([26]) the target is 60
When the input is tensor([26, 60]) the target is 1
When the input is tensor([26, 60,  1]) the target is 49
When the input is tensor([26, 60,  1, 49]) the target is 59
When the input is tensor([26, 60,  1, 49, 59]) the target is 1
When the input is tensor([26, 60,  1, 49, 59,  1]) the target is 60
When the input is tensor([26, 60,  1, 49, 59,  1, 60]) the target is 49
When the input is tensor([26, 60,  1, 49, 59,  1, 60, 49]) the target is 53


In [46]:
# When feeding the transformer we want to pass it multiple sequences/chunks e.g. a batch of sequences/chunks
# So we need to create one more dimension (the batch dimension)

In [73]:
batch_size = 4 # the numnber of independent sequences that will be processed in parallel
block_size = 8 # the maximum context length for predictions

def get_batch(split):
  # generate a small batch of data or inputs x and targets y
  data = train_data if split == 'train' else val_data
  ix = torch.randint(len(data)-block_size, (batch_size,)) # 4 random numbers between 0 and length of data minus block size
  x = torch.stack([data[i:i+block_size] for i in ix])
  y = torch.stack([data[i+1:i+block_size+1] for i in ix])
  return x, y

xb, yb = get_batch('train')
print('Inputs:')
print(xb)
print(f'Input shape: {xb.shape}')
print('Targets')
print(yb)
print(f'Targets shape: {yb.shape}')

Inputs:
tensor([[42, 52, 55, 55, 44,  8,  1,  0],
        [26,  1, 63, 55, 58, 51,  1, 55],
        [ 1,  0, 37, 48, 41, 54, 51, 59],
        [ 8,  1,  0, 37, 48, 45, 65,  1]])
Input shape: torch.Size([4, 8])
Targets
tensor([[52, 55, 55, 44,  8,  1,  0,  7],
        [ 1, 63, 55, 58, 51,  1, 55, 61],
        [ 0, 37, 48, 41, 54, 51, 59,  6],
        [ 1,  0, 37, 48, 45, 65,  1, 60]])
Targets shape: torch.Size([4, 8])


In [75]:
for b in range(batch_size):
  for t in range(block_size):
    context = xb[b, :t+1]
    target = yb[b,t]
    print(f'When the input is {context} the output is {target}')

When the input is tensor([42]) the output is 52
When the input is tensor([42, 52]) the output is 55
When the input is tensor([42, 52, 55]) the output is 55
When the input is tensor([42, 52, 55, 55]) the output is 44
When the input is tensor([42, 52, 55, 55, 44]) the output is 8
When the input is tensor([42, 52, 55, 55, 44,  8]) the output is 1
When the input is tensor([42, 52, 55, 55, 44,  8,  1]) the output is 0
When the input is tensor([42, 52, 55, 55, 44,  8,  1,  0]) the output is 7
When the input is tensor([26]) the output is 1
When the input is tensor([26,  1]) the output is 63
When the input is tensor([26,  1, 63]) the output is 55
When the input is tensor([26,  1, 63, 55]) the output is 58
When the input is tensor([26,  1, 63, 55, 58]) the output is 51
When the input is tensor([26,  1, 63, 55, 58, 51]) the output is 1
When the input is tensor([26,  1, 63, 55, 58, 51,  1]) the output is 55
When the input is tensor([26,  1, 63, 55, 58, 51,  1, 55]) the output is 61
When the input