<a href="https://colab.research.google.com/github/mervegb/nlp-sequence-prediction/blob/main/gpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [26]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-05-19 11:10:22--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.1’


2024-05-19 11:10:22 (18.3 MB/s) - ‘input.txt.1’ saved [1115394/1115394]



In [27]:
#read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
  text = f.read()

In [28]:
print('length of dataset in characters', len(text))

length of dataset in characters 1115394


In [29]:
#look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [30]:
#all the unique characters that occur in this text
chars = sorted(list(set(text))) #so like the possible characters model can see or emit
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


**Tokenization**

breaking down larger body of text into smaller pieces called tokens







In [31]:
#create dictionaries for character-integer mapping
#stoi => string to integer
#itos => integer to string
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}

print(itos)
print(stoi)

#these mappings are crucial for converting characters to integers and back, essentially
#creating a lookup table for both

{0: '\n', 1: ' ', 2: '!', 3: '$', 4: '&', 5: "'", 6: ',', 7: '-', 8: '.', 9: '3', 10: ':', 11: ';', 12: '?', 13: 'A', 14: 'B', 15: 'C', 16: 'D', 17: 'E', 18: 'F', 19: 'G', 20: 'H', 21: 'I', 22: 'J', 23: 'K', 24: 'L', 25: 'M', 26: 'N', 27: 'O', 28: 'P', 29: 'Q', 30: 'R', 31: 'S', 32: 'T', 33: 'U', 34: 'V', 35: 'W', 36: 'X', 37: 'Y', 38: 'Z', 39: 'a', 40: 'b', 41: 'c', 42: 'd', 43: 'e', 44: 'f', 45: 'g', 46: 'h', 47: 'i', 48: 'j', 49: 'k', 50: 'l', 51: 'm', 52: 'n', 53: 'o', 54: 'p', 55: 'q', 56: 'r', 57: 's', 58: 't', 59: 'u', 60: 'v', 61: 'w', 62: 'x', 63: 'y', 64: 'z'}
{'\n': 0, ' ': 1, '!': 2, '$': 3, '&': 4, "'": 5, ',': 6, '-': 7, '.': 8, '3': 9, ':': 10, ';': 11, '?': 12, 'A': 13, 'B': 14, 'C': 15, 'D': 16, 'E': 17, 'F': 18, 'G': 19, 'H': 20, 'I': 21, 'J': 22, 'K': 23, 'L': 24, 'M': 25, 'N': 26, 'O': 27, 'P': 28, 'Q': 29, 'R': 30, 'S': 31, 'T': 32, 'U': 33, 'V': 34, 'W': 35, 'X': 36, 'Y': 37, 'Z': 38, 'a': 39, 'b': 40, 'c': 41, 'd': 42, 'e': 43, 'f': 44, 'g': 45, 'h': 46, 'i': 47,

In [32]:
#Lambda functions in python:
#way to create small,anonymous functions, they don't need to be named
add = lambda x,y: x+y
print(add(5,3))

numbers = [1,2,3,4,5]
squared = list(map(lambda x: x**2, numbers))
print(squared)

even_numbers = list(filter(lambda x: x%2== 0, numbers))

data = [('John', 45), ('Diane', 32), ('James', 28)]
sorted_data = sorted(data, key=lambda x: x[1]) #sort by age
print(sorted_data)

8
[1, 4, 9, 16, 25]
[('James', 28), ('Diane', 32), ('John', 45)]


In [33]:
#define the encoder function
encode = lambda s: [stoi[c] for c in s] #encoder: takes a string, output list of integers

#define the decoder function
decode = lambda l: ''.join([itos[i] for i in l ]) #decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode('hii there')))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


In [34]:
#Tensor in Pytorch:
#multi-dimensional array but with the added capability of being used on GPUs
#and they integrate with gradient calculation, this makes them more suitable than NumPy arrays

import torch

#encode the entire text dataset and store it into torch.Tensor
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

In [35]:
#split the data into train & validation sets
n = int(0.9* len(data)) #first 90% will be the train and rest will be validation
train_data = data[:n]
val_data = data[n:]

#this will help us understand what extent our model is overfitting
#we're going to hide & keep the validation data on side

In [36]:
block_size = 8 #amount of data processed at one time
#it defines how many elements of training data to look at one time

#too-small block size might not provide enough context for accurate predictions
#too-large block size might make model less generalizable

train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [37]:
x = train_data[:block_size] # train_data[:8] - tensor([18, 47, 56, 57, 58, 1, 15, 47])
y= train_data[1: block_size + 1] # train_data[1:9] tensor([47, 56, 57, 58, 1, 15, 47, 58])

for t in range(block_size):
  context = x[:t+1]
  target = y[t]
  print(f"when input is {context} the target: {target}")


#y is the next element following the sequence in x
#this setup is for predicting the next character in a sentence
#by iterating over block_size, each context becomes a training example where the model learns that given context it should predict the target

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


**Batch Size**

number of training examples used to train single iteration of a model
When you train a model you typically don't pass the entire dataset through the network at once (computational efficiency & memory limitations)
Instead you divide the dataset into smaller sets of data called batches


**Example**:
Batch Size: Suppose you set it to 32. This means 32 sequences are processed in parallel during each training step.

<br>

**Block Size**

sequence length
determines how many previous words or tokens the model considers before making prediction

how many past data points are used to predict future values

maximum context length for predictions, this is relevant in models that generate text or make predictions based on past information



In [38]:
torch.manual_seed(1337) #to get same sequence of random numbers
batch_size = 4 #the number of sequences in each batch
block_size = 8 #the number of items in each sequence

#if batch_size is 4 it means every batch generated will contain 4 separate sequences
#each of the ix will be starting point for a sequence in your data

#if block_size is 8 each sequence extracted will have 8 data points

#len(data) -> 1.115.394
#len(data) - block_size => upper boundary of range non inclusive

def get_batch(split):
  data = train_data if split == 'train' else val_data
  ix = torch.randint(len(data) - block_size, (batch_size,)) #batch_size -> will generate 4 random integers

  #[data[i: i + block_size] will take 8 consecutive elements from data, starting from index i
  x = torch.stack([data[i: i + block_size] for i in ix]) #x[76049] = 24)
  y = torch.stack([data[i+1: i + block_size+1] for i in ix])
  return x,y


xb, yb = get_batch('train')
print('inputs')
print(xb.shape)
print(xb)

print('targets')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): #iterates over each sequence in the batch
  for t in range(block_size): #iterates over each element within sequence
    context = xb[b,:t+1]
    target = yb[b,t]
    print(f"when input is {context.tolist()} the target: {target}")

inputs
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53, 

**Bigram Language Model**

predicts the next word in a sequence based on the previous word

- Bigram models are limited by lack of context; they can only look one word back, this can lead to less accurate predictions in complex texts where more context is necessary

- Zero probability problem => if word has never occured in the training data the model won't be able to predict

**Embedding Lookup**

each word in a sequence is mapped to its corresponding vector representation from an embedding matrix, this vector representation captures semantic mearning too

Reshaping is done to prepare the tensors for the loss computation
Cross entropy loss expects the input tensor to have shape (N,C) and target to have shape (N) where N is the total number of predictions (B*T)

B: batch size

T: sequence length

C: number of classes (vocabulary size)

In [43]:
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

  def __init__(self, vocab_size):
    super().__init__()
    #initialize embedding layer
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)


  def forward(self, idx, targets=None):
    logits = self.token_embedding_table(idx) #embedding lookup

    if targets is None:
      loss = None

    else:
      B,T,C = logits.shape
      logits = logits.view(B*T,C) #reshapes logits from (B,T,C) to (B*T,C) to flatten the batch and sequence dimensions into one
      targets = targets.view(B*T)
      loss = F.cross_entropy(logits, targets)

    return logits, loss

  #this method generates new tokens
  #idx tensor of shape (B,T) representing the current context, B batch size, T sequence length
  #max_new_tokens number of new tokens to generate
  def generate(self, idx, max_new_tokens):
    for _ in range(max_new_tokens):
      logits, loss = self(idx) #calls the forward pass to get predictions and loss
      logits = logits[:,-1,:] #becomes (B,C)
      probs = F.softmax(logits, dim=-1)

      idx_next = torch.multinomial(probs, num_samples=1) #(B,1) this contains the indices of the newly generated tokens
      idx = torch.cat((idx, idx_next), dim=1) #extends the current sequences by adding the newly generated tokens (B,T+1)

    return idx


m = BigramLanguageModel(vocab_size) #creates instance of the model
logits, loss = m(xb, yb) #performs forward pass, where xb is input tensor & yb is target tensor

print(logits.shape)
print(loss)

idx = torch.zeros((1,1), dtype=torch.long)
print(decode(m.generate(idx, max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(4.8786, grad_fn=<NllLossBackward0>)

Sr?qP-QWktXoL&jLDJgOLVz'RIoDqHdhsV&vLLxatjscMpwLERSPyao.qfzs$Ys$zF-w,;eEkzxjgCKFChs!iWW.ObzDnxA Ms$3


In [44]:
#Create Pytorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [50]:
batch_size = 32 #each batch will contain 32 samples

#Training loop
for steps in range(10000):
  #sample batch of data
  xb, yb = get_batch('train')

  #evaluate the loss
  logits, loss = m(xb,yb)
  optimizer.zero_grad(set_to_none=True) #clears old gradients from previous steps
  loss.backward() #performs backpropagation
  optimizer.step() #updates model's parameters using the computed gradients

print(loss.item())

2.4988934993743896


In [52]:
print(decode(m.generate(idx, max_new_tokens=300)[0].tolist()))


For, I thy whundlyo d yome PUG Ximuriusanro; shes dur'd, s at CEn CURDYWey t havee.

Yomequcee n r owis poboungalknajus fo tze yonout eit r thom t, ch ar t g
I LOfarsmalle thenierd p ourry, be horar OLI'd TII thithoubepred dar toris, thaums, athacknthene, traMe ars ame nen; frs thenil courit,
An.
Me
