**EE4685 Assignment 2: Building a miniGPT** by Josephine King and Alec Daalman

**References:**
- "Let's build GPT: from scratch, in code, spelled out." Youtube tutorial by Andrej Karpathy. https://www.youtube.com/watch?v=kCc8FmEb1nY
- HuggingFace Tokenizer developer guides. https://huggingface.co/docs/transformers/en/notebooks


In [1]:
# Import packages
import os
from tqdm.notebook import tqdm
import tiktoken
import torch
import torch.nn as nn
from torch.nn import functional as F
import torch.utils.data as data
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace

# Setup
torch.manual_seed(6250513)
CHECKPOINT_PATH = "./saved_models/"
device = torch.device("cpu") if not torch.cuda.is_available() else torch.device("cuda:0")
print("Using device", device)

# Initialize model parameters
TRAIN_PCT = 0.8
BLOCK_SIZE = 128
BATCH_SIZE = 32
MAX_ITER = 5000
VOCAB_SIZE = 750
LR = 1e-3

# Download the TinyShakespeare dataset
!wget -O tinyshakespeare.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('tinyshakespeare.txt', 'r', encoding='utf-8') as f: raw_data = f.read()


Using device cpu
--2025-03-04 18:01:30--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘tinyshakespeare.txt’


2025-03-04 18:01:30 (30.4 MB/s) - ‘tinyshakespeare.txt’ saved [1115394/1115394]



**Data Preprocessing**

Create a custom tokenizer using the HuggingFace Tokenizer package. Then encode the data, convert it into a PyTorch tensor, and split it up into validation data and training data.

In [3]:
# Create the tokenizer 
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(vocab_size=VOCAB_SIZE)
tokenizer.train(["tinyshakespeare.txt"], trainer)
tokenizer.save("tinyshakespeare_tokenizer.json")

# Tokenize the data
tokenizer = Tokenizer.from_file("tinyshakespeare_tokenizer.json")
tokenized_data = tokenizer.encode(raw_data).ids
# Convert into a pytorch tensor
tensor_data = torch.tensor(tokenized_data, dtype=torch.long)

# Split into training and validation sets
train_end = int(len(tensor_data)*TRAIN_PCT)
training_data = tensor_data[:train_end]
validation_data = tensor_data[train_end:]






**Basic Untrained Bigram Language Model**

Create a basic Bigram Language model from Karpathy's tutorial (copied directly). To use this model, we need the function get_batch, which returns a batch from the dataset. Using this untrained model, generate some text and see what we get.

In [4]:
def get_batch(data, batch_size, block_size):
    # Choose batch_size random starting points
    block_starts = torch.randint(0, len(data) - block_size, (batch_size,))
    # Get the inputs and outputs for the chosen blocks, stack them into tensors
    batch_inputs = torch.stack([data[start: start + block_size] for start in block_starts])
    batch_outputs = torch.stack([data[start + 1: start + block_size + 1] for start in block_starts])
    return batch_inputs, batch_outputs

# Copied from Karpathy's tutorial
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

# Create the model and generate some text 
m = BigramLanguageModel(VOCAB_SIZE)
starting_text = "Romeo Romeo wherefore art thou Romeo"
starting_tokens = tokenizer.encode(starting_text).ids
starting_tokens = torch.tensor(starting_tokens, dtype=torch.long).reshape(-1,1)
print(tokenizer.decode(m.generate(idx = starting_tokens, max_new_tokens=100)[0].tolist()))


R ENTIO:
  we  ou  be  ugh re T fear VINCENTIO:
 su ted  th ge  , my   f O fi oul US:
 
 ra when ET  her lif then C est ust   than pra the hat  k I s -  them in  this  R thy  Ha per no  are  am lea MI ot  with I have  aw ough and  ARD fear Ed a  e,
  that  ell that   th up OM ow  ha  st it  X ol es,
 ENTIO:
 Or sir y KING  ain der al ge  to the  , DW ge  ed  ter  LI onour z EDW Which the  ENIUS:
 vo N TIO:
 des is sel


**Train the Bigram Language Model**

Create a function train_model, which takes in training data, a model, and an optimizer and trains the model. Copied/modified from the EE4685 optimization exercise.

In [None]:
# These functions are all copied/modified from the optimization exercise 
def _get_config_file(model_path, model_name):
    return os.path.join(model_path, model_name + ".config")

def _get_model_file(model_path, model_name):
    return os.path.join(model_path, model_name + ".tar")

def _get_result_file(model_path, model_name):
    return os.path.join(model_path, model_name + "_results.json")

def save_model(model, model_path, model_name):
    config_dict = model.config
    os.makedirs(model_path, exist_ok=True)
    config_file, model_file = _get_config_file(model_path, model_name), _get_model_file(model_path, model_name)
    with open(config_file, "w") as f:
        json.dump(config_dict, f)
    torch.save(model.state_dict(), model_file)

def train_model(train_set, model, model_name, optimizer, max_iter=1000, batch_size=256, block_size=32, overwrite=False, save_model=False):
    """
    Train a model on the training set of FashionMNIST

    Inputs:
        train_set - Training dataset
        model - Object of BaseNetwork
        model_name - (str) Name of the model, used for creating the checkpoint names
        max_iter - Number of iterations we want to (maximally) train for
        patience - If the performance on the validation set has not improved for #patience epochs, we stop training early
        batch_size - Size of batches used in training
        overwrite - Determines how to handle the case when there already exists a checkpoint. If True, it will be overwritten. Otherwise, we skip training.
    """
    file_exists = os.path.isfile(_get_model_file(CHECKPOINT_PATH, model_name))
    if file_exists and not overwrite:
        print(f"Model file of \"{model_name}\" already exists. Skipping training...")
        with open(_get_result_file(CHECKPOINT_PATH, model_name), "r") as f:
            results = json.load(f)
    else:
        if file_exists:
            print("Model file exists, but will be overwritten...")

        ############
        # Training #
        ############
        model.train()
        for iter in range(max_iter):
            inputs, outputs = get_batch(train_set, batch_size, block_size)
            inputs, outputs = inputs.to(device), outputs.to(device)
            optimizer.zero_grad()
            preds,loss = model(inputs, outputs)
            if iter % 500 == 0 or iter == max_iter - 1:
                print(f"iter {iter}: loss = {loss}")
            loss.backward()
            optimizer.step()

        if (save_model):
            save_model(model, CHECKPOINT_PATH, model_name)

    return model


Train the model and print out the loss values to see how more iterations improve the model.

In [6]:
# Create our Bigram model and and Adam optimizer
bigram_model = BigramLanguageModel(VOCAB_SIZE)
optimizer = torch.optim.AdamW(bigram_model.parameters(), lr=LR)
bigram_model = train_model(training_data, bigram_model, "bigram_model", 
                            optimizer, max_iter=MAX_ITER, batch_size=BATCH_SIZE, block_size=BLOCK_SIZE)

step 0: train loss 7.143737316131592
step 500: train loss 6.596506595611572
step 1000: train loss 6.02191686630249
step 1500: train loss 5.642648696899414
step 2000: train loss 5.2434821128845215
step 2500: train loss 4.948191165924072
step 3000: train loss 4.658974647521973
step 3500: train loss 4.507926940917969
step 4000: train loss 4.327456474304199
step 4500: train loss 4.228084564208984
step 4999: train loss 4.095427513122559


Generate some text from the trained model and see how it compares to the untrained model.

In [7]:
starting_text = "Romeo Romeo wherefore art thou Romeo"
starting_tokens = tokenizer.encode(starting_text).ids
starting_tokens = torch.tensor(starting_tokens, dtype=torch.long).reshape(-1,1)
print(tokenizer.decode(bigram_model.generate(idx = starting_tokens, max_new_tokens=100)[0].tolist()))

R ome ap  thy  n ,  Go ese  't ant :
 S OM ge  su ck ET ain 
 LE sin ,  I  l ow ment s ad had  bl OR D IUS:
 pla in  your any  think  s o,  you  are  not  had  b ear ful  w o o  was  fi e ver n  t w   hath  hear t
 
 C ish 'd  he  ap ,  no  off their  b ack ,  his  m But  I HEN augh t,  li e g er:
 my  di es  will  fair ,  N ust 
 A  am ong u Which   est
