<a href="https://colab.research.google.com/github/neel26desai/transformers_and_finetuning_with_LLM/blob/main/Creating_your_own_GPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

I this colab we will be creating out own GPT, using a latge corpus of data. We will be using "As you Like It" play written by William Shakespears, as our main corpuse, this will result in our transformer, generating text as if it werin the play it self.

We will be following https://www.youtube.com/watch?v=kCc8FmEb1nY&t=18s, as our guide on how to train out model.


The data which we are using can be found on https://shakespeare.mit.edu/asyoulikeit/full.html, the entire play is there just copy it and paste it in a txt file


Instead of predicting the next word in the sequence, we will make the problem easier, for scrath implementation, my making the problem focus on predicting the next character of the sequence

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [15]:
# hyperparameters, Defines various hyperparameters, such as batch size, block size, number of iterations, learning rate, etc., to control model training behavior.
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0

# Data Loading and Encoding/Decoding


Sets up character-level encoding and decoding:
stoi: Converts characters to integers.
itos: Converts integers back to characters.
encode/decode are lambda functions for encoding and decoding strings.

1.   Loads text data from input.txt.
2.   Sets up character-level encoding and decoding:
      1. stoi: Converts characters to integers.
      2. itos: Converts integers back to characters.
      3. encode/decode are lambda functions for encoding and decoding strings



In [3]:
# read it in to inspect it
with open('as_you_like_it.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [4]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  120352


In [5]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string


# Train Test Split

Converts the loaded text data into a tensor, then splits it into training (90%) and validation (10%) datasets.

In [13]:
# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# Batch Data Loader
Generates batches of data for training and validation by sampling random indices.

In [None]:
# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y


#  Loss Estimation for Evaluation

Evaluates the training and validation loss without calculating gradients, allowing the model to be evaluated periodically.

In [6]:
#function will be used for calculating loss
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out


# Self-Attention Mechanism

## Head Class


**Self-Attention Basics**

The `Head` class implements a single attention head to focus on different parts of the input sequence.

**Key Concepts:**

* **Key (k):** A transformation to derive the "importance" of tokens.
* **Query (q):** A transformation to determine which tokens are relevant to each other.
* **Value (v):** A transformation of the input to create a representation for each token.

**Masking:**

A triangular mask (tril) ensures that the model does not attend to future tokens, preserving causality.

**Output:**

The weighted attention scores (wei) are combined with values (v) to aggregate information from the input sequence.

In [7]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

## Multi-Head Attention

Instead of a single attention head, multi-head attention utilizes several heads operating in parallel. This empowers the model to capture diverse relationships between tokens. Each head can focus on distinct aspects of the input, resulting in a richer understanding.

**Initialization:**

The class creates a list of attention heads. Each head independently computes its own set of query, key, and value transformations.

**Forward Pass:**

The outputs from each head are concatenated along the last dimension using `torch.cat`.

These concatenated outputs are then fed through a final linear projection (`proj`) followed by a dropout layer.

In [8]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out


## Feedforward Network

The feedforward network is a straightforward yet potent mechanism that augments the model's capability.

It comprises:

* **Linear Layer:** Expands the input dimension (`n_embd`) to four times its size (`4 * n_embd`).
* **ReLU Activation:** Introduces non-linearity.
* **Linear Layer:** Reduces the dimension back to the original size (`n_embd`).
* **Dropout Layer:** Prevents overfitting.

This component aids the model in learning intricate data transformations post the application of the attention mechanism.

In [9]:
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)


## Transformer Block

The Transformer block is the fundamental building block of Transformer-based architectures. It alternates between:

1. **Multi-Head Self-Attention (sa):** Followed by Layer Normalization (ln1).
2. **Feed Forward Network (ffwd):** Followed by another Layer Normalization (ln2).

**Residual Connections (x +):**

These connections are pivotal for ensuring effective gradient flow during backpropagation. They expedite learning and enhance convergence during training.

In [10]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x


## BigramLanguageModel

**Embedding Layers**

* **Token Embedding (token_embedding_table):** Converts token indices into vector representations.
* **Positional Embedding (position_embedding_table):** Learns a representation for each position in the sequence, enabling the model to comprehend the order of tokens.

**Blocks**

The model is constructed by stacking multiple Transformer blocks (`blocks`), each comprising self-attention and feedforward layers.

**Layer Normalization and Linear Projection**

Following all Transformer blocks, the model applies Layer Normalization (`ln_f`) and a final linear projection (`lm_head`) to generate logits for each token in the vocabulary.

**Loss Calculation**

If targets are provided, the model computes the cross-entropy loss between the logits and targets.

**Text Generation (generate)**

Generates new text iteratively by predicting the next token and appending it to the sequence. The multinomial sampling strategy is employed to generate tokens from the distribution.

In [11]:
# super simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx


# Training

*Model Instantiation and Device Placement**

The model is instantiated and transferred to the appropriate device (CPU or GPU).

**Optimizer**

AdamW is employed for optimization, combining adaptive gradient methods with weight decay to enhance training.

**Training Loop**

The model is trained for `max_iters` iterations.

Every `eval_interval` steps, the model's training and validation loss is printed to monitor training progress.

A batch is sampled using `get_batch`, and the model calculates the loss.

Backpropagation (`loss.backward()`) and an optimizer step are used to update model weights.

In [16]:
model = BigramLanguageModel()
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()


0.2096 M parameters
step 0: train loss 4.3407, val loss 4.3299
step 100: train loss 2.6129, val loss 2.6656
step 200: train loss 2.4445, val loss 2.4997
step 300: train loss 2.3624, val loss 2.4274
step 400: train loss 2.2596, val loss 2.3299
step 500: train loss 2.1844, val loss 2.2557
step 600: train loss 2.1278, val loss 2.2025
step 700: train loss 2.0697, val loss 2.1367
step 800: train loss 2.0155, val loss 2.0997
step 900: train loss 1.9880, val loss 2.0648
step 1000: train loss 1.9429, val loss 2.0381
step 1100: train loss 1.9172, val loss 1.9998
step 1200: train loss 1.8804, val loss 1.9787
step 1300: train loss 1.8564, val loss 1.9493
step 1400: train loss 1.8392, val loss 1.9496
step 1500: train loss 1.8125, val loss 1.9240
step 1600: train loss 1.7844, val loss 1.9244
step 1700: train loss 1.7637, val loss 1.9024
step 1800: train loss 1.7542, val loss 1.8913
step 1900: train loss 1.7418, val loss 1.8726
step 2000: train loss 1.7344, val loss 1.8747
step 2100: train loss 1.71

# Inference
**Context Initialization**

Initiates with a tensor of zeros, symbolizing the beginning of text generation.

**Text Generation**

The model generates new tokens for `max_new_tokens` steps using the `generate` function.

Finally, the generated indices are converted back to text using `decode` and printed.


In [17]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))



Alcan thee, now be to change ward that sin,
I thrugh all take from span than this this honestreatisual.
ORLLANDO
O, all
And as his all enemortand many, sir: I fir should deast two be at man haved you.
ROSALIND
Alas I had did not brafess thou ctand:
If as she mursure and under, he word place.
JAQUES
Thus we know. He, wise
Come, that sbetted en my for fair?
ORLANDO
I tell be no muzk canst: If you as take thoubbere.
LE BesieuK
ORLAND, CERICK
ORLANDO
Why, you clean to this rewifiture as I much her than swo
as bask dead kind his but of blapench?
ORLANDO
Gone west throufarn good blat gives?
ORLANDO
Callin you do alt you.
ORLANDO
Whow I will go yout pisted, as a burt, To-morrow;
And, thou will you have me, been chime man hath
her liescansword; why sick upon sportd? O, who counster the duke it this
whenly lifestard mays: cond, poor of forth now,
A Hock What was give Oliver, bein shid, with his heart

he forest a reshed black this stai.
Then a haave not stan?
CORIN
Is did will in aymile, it ha