<a href="https://colab.research.google.com/github/nnema05/learning-LLMs/blob/main/small_scale_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a small scale LLM

Implementing a small scale LLM with small dataset, character-level tokenizer

References:

[Let's build GPT: from scratch, in code, spelled out.](https://www.youtube.com/watch?v=kCc8FmEb1nY&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=7)

[Building LLMs from the Ground Up: A 3-hour Coding Workshop](https://www.youtube.com/watch?v=quh7z1q7-uc )


In [22]:
!pip install datasets

# TensorFlow and PyTorch convert all data into tensors which are general purpose container
import torch
import torch.nn as nn
import torch.nn.functional as F

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device:", device)

Using device: cuda


In [23]:
# pip install -U datasets

In [24]:
## STEP 1: Dataset!

from datasets import load_dataset
# small dataset for training small scale llms
dataset = load_dataset("roneneldan/TinyStories")
print(dataset["train"][0])

{'text': 'One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.\n\nLily went to her mom and said, "Mom, I found this needle. Can you share it with me and sew my shirt?" Her mom smiled and said, "Yes, Lily, we can share the needle and fix your shirt."\n\nTogether, they shared the needle and sewed the button on Lily\'s shirt. It was not difficult for them because they were sharing and helping each other. After they finished, Lily thanked her mom for sharing the needle and fixing her shirt. They both felt happy because they had shared and worked together.'}


In [25]:
print(dataset)


DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2119719
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 21990
    })
})


In [26]:
## STEP 2: Get Text!

# LLM needs sequence of text
# extract text from dataset and combine into one long string
  # you get the first 500 text samples.
texts = [example["text"] for example in dataset["train"].select(range(500))]
raw_text = "\n".join(texts)

print("Total characters:", len(raw_text))
print("Preview:", raw_text[:300])

Total characters: 402933
Preview: One day, a little girl named Lily found a needle in her room. She knew it was difficult to play with it because it was sharp. Lily wanted to share the needle with her mom, so she could sew a button on her shirt.

Lily went to her mom and said, "Mom, I found this needle. Can you share it with me and 


In [27]:
## STEP 3: Create Tokenizer!

# character level tokenizer
  # this means each character in a text is a separate token
  # has a fixed vocab: define it once and it maps char → int
  # the tokenizer is a tool that converts text into tokens (in our case characters)
  # then maps each character to a token id using a vocabulary (list of unique token)
  # and uses the map to encode and decode text from token -> id and vice versa
  # id's are so machine can process text with ints (and eventually vectors of ints)


# get a list of unique characters that occur in this text
# unique characters = vocabulary
# the number of them is our vocabulary size

chars = sorted(list(set(raw_text)))
vocab_size = len(chars)
print(f"Unique characters: {vocab_size}")


Unique characters: 74


In [28]:
# continue tokenizer by making a map from token char -> int for our vocab!
class CharTokenizer:
    def __init__(self, text):
        self.chars = sorted(list(set(text))) # gets vocabulary or unique char in text
        self.vocab_size = len(self.chars)
        # stoi is our dictionary mapping for each token char -> int
        self.stoi = {ch: i for i, ch in enumerate(self.chars)}
        # itos is dictionary mapping each int -> token char
        self.itos = {i: ch for i, ch in enumerate(self.chars)}

    def encode(self, s):
        return [self.stoi[c] for c in s] # takes a string, returns a list of integers corresponding to each char in string

    def decode(self, ids):
        return ''.join([self.itos[i] for i in ids]) # takes a list of integers, output a string

In [29]:
tokenizer = CharTokenizer(raw_text)

encoded = tokenizer.encode("hellooo")
decoded = tokenizer.decode(encoded)

print("Encoded:", encoded)
print("Decoded:", decoded)

Encoded: [49, 46, 53, 53, 56, 56, 56]
Decoded: hellooo


In [30]:
## STEP 4: Turn tokenized text into training sequences!
  # Create training data for the model by slicing your tokenized text into many short input→target pairs.
  # LLMs are trained to predict next token
  # so need training examples like:
    # Input Sequence: "The el" -> Target Sequence: "he elk"
    # Input Sequence: "he elk" -> Target Sequence: "e elk "

# tensors are containers that work like a matrix  but extended to any number of dimensions (needed ML work is just matrix math)
# encode text into list of token id's and store it in a tensor!
data = torch.tensor(tokenizer.encode(raw_text), dtype=torch.long)
print(data.shape, data.dtype) # 1D vector with 402933 tokens (vector of all our token ids)
  # 402933 is number of characters in our raw text!

torch.Size([402933]) torch.int64


In [31]:
# split data into training data and validation data
# this is to get a sense of overfitting
# overfitting occurs when the model memorizes/learns training data perfectly but but performs badly on new or unseen data
# hiding part of data from model (validation_data) allows it to see how well it predicts on the unseen part
  # training data is the only data model sees during training
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

In [32]:
# block size = how many tokens the model sees at once
  # this is the length of patterns the model can learn at a time
  # we will never feed the entire text into transformer
  # the chunks have a max size or maximum length
  # maximum length is called block size!
block_size = 64 # do 64 tokens

# batch size is ow many independent training sequences (of size block size) are processed in parallel during one training step
  # basically giving model 4 mini documents each 64 tokens long
  # it reads each document independently but in parallel!
  # this batching speeds up training!
    # instead of updating the model 1 sequence at a time, we do 4 at once
batch_size = 4

In [33]:
# to visualize how LLM will check tokens one at time
# gets the input data up to block size and then the y is the target offset by 1
  # so it includes the next token
x = train_data[:block_size]
y = train_data[1:block_size+1] # y is next block size characters so its offset by 1

# so for each block size (it checks the token before) and aims to predict the next target token
  # for easier visualization in for loop make block size 8!
small_block_size = 8
for t in range(small_block_size):
    context = x[:t+1] # get the all the context or all the characters right before t + 1
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([31]) the target: 55
when input is tensor([31, 55]) the target: 46
when input is tensor([31, 55, 46]) the target: 1
when input is tensor([31, 55, 46,  1]) the target: 45
when input is tensor([31, 55, 46,  1, 45]) the target: 42
when input is tensor([31, 55, 46,  1, 45, 42]) the target: 66
when input is tensor([31, 55, 46,  1, 45, 42, 66]) the target: 6
when input is tensor([31, 55, 46,  1, 45, 42, 66,  6]) the target: 1


In [34]:
# function that gets our batch of 4 block-sized seqeunces at a time
# we pick 4 radnom starting positions i in the the text
  # so then you extract x (input) and y (target) which are block sized seqeunces
    # where target is one offset from input!
    # random starting points for sequences gives you a new batch of examples everytime
       # also avoids overfitting bc it prevents the model memorizing specific patterns at specific positions
       #  model can’t rely on positional bias like “always start with ‘Once’”
       # this is how it learns structure, not just surface patterns.
def get_batch(split):
    source = train_data if split == "train" else val_data # token id training data!
    ix = torch.randint(len(source) - block_size, (batch_size,)) # picks batch size number of random positions
    x = torch.stack([source[i:i+block_size] for i in ix]) # inputs
      # extracts 4 slices of 64 tokens at random starting positions and stacks them in x
    y = torch.stack([source[i+1:i+block_size+1] for i in ix]) # targets
      # extract same 64 tokens but shifted 1 token forward
    return x, y

xb, yb = get_batch("train")
print("input batch dimensions:", xb.shape) # (4, 64) (4 64 token sequences!)
print('inputs:')
print(xb)
print("target batch dimensions:", yb.shape)
print('targets:')
print(yb)

input batch dimensions: torch.Size([4, 64])
inputs:
tensor([[ 1,  3, 24, 46, 53, 53, 56,  1, 61, 49, 46, 59, 46,  6,  1, 25,  5, 54,
          1, 61, 59, 62, 44, 52,  2,  1, 39, 56, 62, 53, 45,  1, 66, 56, 62,  1,
         53, 50, 52, 46,  1, 61, 56,  1, 61, 42, 52, 46,  1, 42,  1, 59, 50, 45,
         46, 16,  3,  0, 36, 49, 46,  1, 53, 50],
        [ 1, 29, 42, 59, 66,  1, 64, 46, 55, 61,  1, 61, 56,  1, 61, 49, 46,  1,
         57, 42, 59, 52,  6,  1, 52, 50, 44, 52, 46, 45,  1, 61, 49, 46,  1, 44,
         50, 59, 44, 53, 46, 60,  1, 42, 59, 56, 62, 55, 45,  1, 47, 56, 59,  1,
         42,  1, 64, 49, 50, 53, 46,  6,  1, 42],
        [55, 48,  1, 50, 47,  1, 60, 49, 46,  1, 62, 60, 46, 45,  1, 49, 46, 59,
          1, 50, 54, 42, 48, 50, 55, 42, 61, 50, 56, 55,  8,  0, 31, 55, 44, 46,
          1, 62, 57, 56, 55,  1, 42,  1, 61, 50, 54, 46,  6,  1, 61, 49, 46, 59,
         46,  1, 64, 42, 60,  1, 42,  1, 53, 50],
        [64,  1, 50, 61,  1, 54, 42, 45, 46,  1, 49, 46, 59,  1, 47, 

In [35]:
## STEP 5: Build a transformer! (small only 2 layers!)
  # like GPT is a decoder only transformer (which generates text by predicting the next token )

  # training data gets passed through transformer
  # transformer is a type of neural network (neural network is layers is  input layer (receives initial data), one or more hidden layers (perform computations), and an output layer)
    # these connections have weights: Neurons within and between layers are connected
    # network learns weights that help it make decisions (about what comes next between layers)
    # weights are updated during training to reduce prediction error
  # transformers handle sequences and sees tokens in a sequence at once, and uses attention
  # transformers have two parts
    # Self-attention: Every word can use other words in input as context
    # Layered architecture: Transformers stack layers of attention + feedforward networks, each layer transforming the input


In [36]:
# variables needed for transformer
batch_size = 4
block_size = 64
vocab_size = tokenizer.vocab_size  # number of unique characters
# embedding is a way to turn an integer (like a token ID) into a vector
embed_dim = 128  # each token becomes a vector of size 128

# heads: One head = one way to look at relationships between tokens.
# single head attention: learns how one token should pay attention to others in the same sequence
num_layers = 2  # how many transformer blocks
  # transformer block is a building unit that has Self-Attention, Feedforward MLP, LayerNorm + Residual Connections
  # we want two layers of this two repeat this process twice
  # because the deeper the network, the more learning of complex and abstract patterns

In [37]:
# 1. first neural network component called Embedding layer!
  # Why Embedding?: embedding is a way to turn an integer (like a token ID) into a vector
  # the vector is the way neural network understand token id's
  # thats how the transformer looks at every token (as a vector!)
  # these vectors are learable: Embedding vectors are initialized randomly at the start
  # During training: model predicts the next token, loss is calculated, then we do backpropagation to update the embedding values to reduce that error
  # After training:  values in each token’s embedding vector determine how relationship of that token is to other tokens
    # similar words should have similar vectors after training.

  # AS THE VECTORS CHANGE: they are outputs from each of the layers!!
    # they are changing as neurons do some math to chnage them!

class EmbeddingLayer(nn.Module): ## nn.Module is a PyTorch base class (PyTorch way to define and run a neural network)
  # build off of nn.Module to for anything that has layers, parameters, and it knows how to run forward computations

    def __init__(self, vocab_size, embed_dim, block_size):
        super().__init__()
        # creates a look up table where we map token IDs to vectors, each vector of length embed_dim and number of vectors is vocab size
          # 64 by 128
        self.token_embedding = nn.Embedding(vocab_size, embed_dim)
        # look up table of positions: transformers have no sense of order so it maps the 64 position indexes to a vector
          # 64 by 128
        self.position_embedding = nn.Embedding(block_size, embed_dim)
        # input = token_embedding + position_embedding = what token is + where token is in our block sequence

    # forward() defines how data moves through the layers, this is the computation
      # takes in the batch of tokens as an input and modifies it by emedding the tokens and passing it to next layer!
    def forward(self, x): # x is a batch of token ID seqeunces (input into embedding layer!)

        B, T = x.shape # B = batch size, T = time steps (number of tokens per seqeunce)
        # token_embed is vector for our each of our tokens in x
        token_embed = self.token_embedding(x)                   # (B, T, D) # D = emed_dim = size of vector
        # tensor that has all positions in seqeunce [0, 1, .. T-1]
        positions = torch.arange(T, device=x.device)           # (T,)
        # one vector for each position T from position embedding look up table
        pos_embed = self.position_embedding(positions)[None, :, :]  # (1, T, D)
        # adds positional info to each vector token
        return token_embed + pos_embed  # (B, T, D)
          # this output is like a stack of 4 (B) matrices where each 64 rows (T) × 128 columns (D)
            # so each token in batch is assoicated with its vector!
          # output to that will be passed through later transfomer layers!

# neuron in your transformer is something like: output = ReLU(w₁·x₁ + w₂·x₂ + ... + b)
  # takes input numbers (like values in an embedding), Multiplies by weights, Adds a bias, passes through an activation, outputs a number

In [38]:
# Attention is a communication mechanism.
# Can be seen as nodes in a directed graph looking at each other and aggregating information with a weighted sum from all nodes that point to them, with data-dependent weights.
# we can allow token vectors to talk to each other
  # if we mask then tokens can only talk to tokens in the past!
# masked self-attention does this! (allows each token to look at itself and every earlier token, decide how much to weigh each other token's contribution)
# every single token emits two vectors, a query and a key
# the query vector is what I"m looking for and the key vector is what do I contain
# do a dot product between key and query and that dot product now becomes attention weights
  # attention weight = how much each token wants to focus on every other token in its sequence

# self attention: each token wants to compute a new version of itself based on its own meaning and the context
class SelfAttention(nn.Module):
    def __init__(self, embed_dim): # embed dim: size of vector you are working with
        super().__init__()
        # each token vector is used to compute three seperate layers and get new vectors at each layer:
          # linear layers have weights that control how input is turned into query, key and value
          # these weughts are learned
        self.key = nn.Linear(embed_dim, embed_dim, bias=False) # what do I contain
        self.query = nn.Linear(embed_dim, embed_dim, bias=False) # what am I looking for
          # each query vector is a way for this token to define what kinds of relationships it cares about
            # or what tokens in seqeuence are relevant to me right now
        self.value = nn.Linear(embed_dim, embed_dim, bias=False) # what information should I share with other tokens

        # builds a lower triangular matrix mask
          # so token at position t can only look at itself and tokens before it (not future tokens)
            # to get context!
            # bc LLM's are generating text sequentially, masked self-attention is used to prevent model from looking at future tokens that haven't been generated yet!
        self.register_buffer("mask", torch.tril(torch.ones(block_size, block_size)))

        # dropout: randomly drops some neurons out and trains without them
          # this trains a lot of subnetworks by doing this in each pass
          # then at final testing it combines all the subnetworks as part of the neural network
            # This forces the network to not rely too heavily on any single neurons or tokens
        self.dropout = nn.Dropout(0.1) # drop out is

  # forward() defines how data moves through the layers, this is the computation
    def forward(self, x): # x is the input from the embedding layer of
        B, T, C = x.shape # C = naming convention is channel = embed_dim = size of token vector
        # gets key, qery, value vectors for each token in sequence
        # one C-dim vector for each of T tokens in B batches
        k = self.key(x)    # (B, T, C)
        q = self.query(x)  # (B, T, C)
        v = self.value(x)  # (B, T, C)

        # each query then is dot-multiplied with all the key vectors in the sequence to compute
          # How strongly does token A match token B?
          # result is attention weights so each token now knows how much it wants to focus on every other token in its sequence.
          # these are row scores (how well each token’s query matches every other token’s key in the sequenc)
        attn_weights = q @ k.transpose(-2, -1) * (C ** -0.5)  # (B, T, T)
        # applies the lower triangular mask (sets future tokens to -inf)
          # ensures model predicts one token at a time using only the past
        attn_weights = attn_weights.masked_fill(self.mask[:T, :T] == 0, float('-inf')) # -infinity
        # turn raw scores into probabilities between 0-1 that sum to 1
          # this is bc we want to compute weighted average of value vectors (later on)
            # so weights need to be non-negative, sum to 1 so model can say "I’ll take 70% of this token’s value, 20% of that one, and 10% of this other one"
        attn_weights = F.softmax(attn_weights, dim=-1)
        # dropout: 0's out some attention probabilities at random
        attn_weights = self.dropout(attn_weights)

        # compute weighted average of value vectors so multiple each value vector by how much it was atteneded to
          # Each token gets its own query, which creates its own attention distribution:
          # for each token you multiply attention weights for one token by value vectors of other tokens
        # matrix multiplication does this for all my tokens
        out = attn_weights @ v  # (B, T, C)
        return out # output is made of new vector per token which has CONTEXT from other tokens




In [39]:
# transformer block
  # run self attention:  builds context-aware vectors
  # refines its own vector with feedforward network/Multi-Layer Perceptron
  # adds in original token vectors with new computed vectors
  # normalizes everything: scaling numerical data to a standard range
# refining each token

class TransformerBlock(nn.Module):
    def __init__(self, embed_dim):
        super().__init__()
        # creates our self-attention block which will output our contextualized batch of token vectors
        self.attn = SelfAttention(embed_dim)

        # feedforward = Multi-Layer Perceptron
        # feedfforward block is a small neural network that helps each token get refined
          # adds complexity/nonlinearlity to each token seperately
          # with nonlinearlity you can model complex, non-linear patterns
        self.ffwd = nn.Sequential(
            nn.Linear(embed_dim, 4 * embed_dim), # exapnds vector space
            nn.ReLU(), # adds nonlinearity (which allows for detection of complex things rather than just using linear functions)
            nn.Linear(4 * embed_dim, embed_dim), # compresses the vector back after complexity is added
        )
        # defines layer normalization the vectors to ensure all token vectors are on same stable scale of numbers
        self.ln1 = nn.LayerNorm(embed_dim)
        self.ln2 = nn.LayerNorm(embed_dim)

    def forward(self, x): # x is our batch
      # do the attention to return a new context-aware version of each token
      # Residual connection: add original token to attention output to keep original token info in total output
      # Normalizes result!
        x = self.ln1(x + self.attn(x))   # residual connection + norm
      # after each vector goes through the Multi-Layer Perceptron
        # add another residual connection (original info about token added back into new computed output)
        # normalzie result
        x = self.ln2(x + self.ffwd(x))   # residual connection + norm
        return x # (B, T, D) -> batch of refined token vectors



In [40]:
# Full transformer with all of our layers

class TinyTransformer(nn.Module):
    def __init__(self):
        super().__init__()
        # emedding layer: turns token id's into vectors with positional infomration for each token
        self.embedding = EmbeddingLayer(vocab_size, embed_dim, block_size) # (B, T, D)
        # num_layers of Transformer blocks (which applies self attention, feedforward and adds in residuals (old info))
        self.blocks = nn.Sequential(*[TransformerBlock(embed_dim) for _ in range(num_layers)])
        # layer norm which normalizes vector one last time
        self.ln_f = nn.LayerNorm(embed_dim)
        # final prediction layer which maps each token vector a vector of length vocab size
        self.head = nn.Linear(embed_dim, vocab_size)  # maps final vector to vocab logits

    def forward(self, x, targets=None):
        tok = self.embedding(x)        # (B, T, D) # get embeddings
        out = self.blocks(tok)         # (B, T, D) # pass through trasnformer blocks
        out = self.ln_f(out)           # (B, T, D) # normalizes vectors
        # gets logit scores!
         # so for each token position in input seqeunce, model outputs corresponding vector of logit scores
          # vector of logit scores is the size of vocab size
          # each logit scores represent the raw prediction of how likely each possible next token is
          # logit scores are NOT probabilities yet!
        logits = self.head(out)        # (B, T, vocab_size)

        # mode if not training (if no training targets are given just return logits)
        if targets is None:
            return logits

        # else train! compute loss for training from the targets
        B, T, C = logits.shape # logits is Batch × Time-stamp (token) predictions where each prediction has a vector of vocab size for logit scores
        logits = logits.view(B*T, C) # reshapes logits and targets for cross entropy function
          #B*T = produces total number of tokens in batch
        targets = targets.view(B*T)
        # computes loss by comparing predicted scores to the correct next-token IDs and gives 1 loss number
          # loss number is small if model predicts correctly!
        loss = F.cross_entropy(logits, targets)
        return logits, loss

    # generate to generate new tokens!
    # gets initial input, runs the model forward, selects the next token, appends
    # repeats until you've generated total max_new_tokens
    @torch.no_grad() # tells pytorch to not track gradients during function
      # gradients are used during training to update the model weights, calculates how wrong the model was and how to adjust weights for next time
      # but since we are in generation (not training) there is no need to use gradients so it runs faster
    def generate(self, idx, max_new_tokens):
      # idx is the token ids (B, T) B = batch size, T = length of input
      # max new tokens is how many additional tokens to generate
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -block_size:]  # crop sequence to to last block_size tokens
            # run model to get logits!
            logits = self(idx_cond) if isinstance(self.forward(idx_cond), torch.Tensor) else self(idx_cond)[0]
            # logits shape before: (B, T, vocab_size) — one prediction per token
            # After: (B, vocab_size) = prediction for the last token
            logits = logits[:, -1, :]
            # convert logits into probabilities
            probs = F.softmax(logits, dim=-1)
            # randomly pick one token from probablity distribution
              # picking randomly instead of max allows for variation
            idx_next = torch.multinomial(probs, num_samples=1)
            # append to sequence
            idx = torch.cat((idx, idx_next), dim=1)
        # after generation additional new tokens, you return your entire generated sequence!
        return idx



In [41]:
model = TinyTransformer()
print(f"Model parameters: {sum(p.numel() for p in model.parameters())}")

Model parameters: 390218


In [43]:
# TEST MODEL WITHOUT TRAINING!
model = model.to(device)
# pick a starting character or sequence
start_text = "The elk runs through"

# encode it to token IDs
start_ids = tokenizer.encode(start_text)

# convert to tensor then move to device
start_tensor = torch.tensor([start_ids], dtype=torch.long).to(device)

# Generate 100 tokens after the start
output_tensor = model.generate(start_tensor, max_new_tokens=100)

# Convert tensor back to list of token IDs
output_ids = output_tensor[0].tolist()

# Decode back to string
generated_text = tokenizer.decode(output_ids)

print("Generated text:")
print(generated_text)

# The text is gibberish! This is because we haven't trained it!

Generated text:
The elk runs throughq"wo 1"GkZUft '-sE$ku gUrqzOVdâ:L$g2bLœ“™kz
s.q:JT™2KlœpZNDUHo!sLœr3Pz;?.t””LvvIVJ.pKufmf.cD3jwO;0zv
