## GPT Model trained on tiny-Shakespeare data and Optimized performance and efficiency
This is a separate notebook for experimenting with the model.
Key Components of the model are:
1. word embedding: **Sub-word level tokens**
2. Positional encoding: **Rotational PE**
3. multi-headed self-attention (No cross attention): **GeLU activation function**
4. Transformer Decoder block (No Encoder since it is generating text on its own) - **RMS Normalization**

Other Optimization Implemented during Training:
- Weight Decay
- Learning Rate Scheduling
- Gradient Clipping
- Mixed precision training using Pytoch's Automatic Mixed Precision package

In [1]:
# Download the tiny-Shakesspeare text data from Github repo
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2023-12-21 07:14:32--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: 'input.txt.3'


2023-12-21 07:14:32 (17.8 MB/s) - 'input.txt.3' saved [1115394/1115394]



In [2]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import time
from torch.cuda.amp import GradScaler, autocast # For mixed precision Training
#import torch_xla
#import torch_xla.core.xla_model as xm

Import data and Explore

In [3]:
# Read the tiny-Shakespeare txt file
with open('input.txt',mode='r',encoding='utf-8',closefd=True) as f:
  text = f.read()

In [4]:
# View file and its stats
print(type(text),len(text),text[1:100])

<class 'str'> 1115394 irst Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You


In [5]:
# List all unique characters used in the text
char1 = set(text)
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars),vocab_size,char1)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz 65 {"'", '$', 'H', 'P', 'l', 's', 'g', 'C', ':', 'G', 'b', ' ', 'D', 'K', 'v', 'J', 'S', ';', 'u', 'r', 'y', 'T', 'E', 'L', 'n', 'F', '!', '3', 'a', '&', '\n', 'f', 'O', 'B', 'w', 'q', 'm', 'p', 'U', '?', 'j', 'x', 'Q', 'Y', 'A', 'I', 'N', ',', '.', 'o', 'd', 'W', 'i', 'R', 'z', 'M', 'e', '-', 'Z', 't', 'k', 'X', 'V', 'c', 'h'}


Tokenize the text data

In [6]:
#print(''.join(chars),vocab_size,char1) # Create a mapping of characters and integers
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for i,ch in enumerate(chars)}
#print(stoi['A'],'\n', itos)

# Encoder takes string as input and provides integers as output.
encode = lambda s: [stoi[c] for c in s]

# Decoder takes list of integers as input an provides string as output
decode = lambda l: ''.join([itos[i] for i in l])

#print(encode('Hi. I am Nimish!!'))
#print(decode(encode('Hi. I am Nimish!!')))


In [7]:
# Lets encode entire text dataset and store it into a torch.tensor
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape,'\n',data.dtype,'\n',data[:100])

torch.Size([1115394]) 
 torch.int64 
 tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59])


Data prep for training
Split the data in to Train and Val (10%)
Divide the data in to chunks/blocks and batches

In [8]:
# Divide the data in to train (90%) and validation set (10%)
n = int(0.9*len(data))
train_data = data[:n]
val_data = data[n:]

In [9]:
# data loading
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data

    # Create batch of 4 randomly generated integers within length of text data
    ix = torch.randint(len(data) - block_size, (batch_size,))

    # Create batch of 4 blocks (each of block size 8) from randomly selected integers for parallel processing
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])

    #move x, y parameters to GPU if available or on CPU
    x, y = x.to(device), y.to(device)
    return x, y

In [10]:
@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        
        # Evaluation is run for eval_iters iteration and computes losses for each iteration
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            
            
            # Loss for each batch is averaged over GPU tensors and stored as scalar in losses tensor at k index
            #losses[k] = loss.item()
            losses[k] = loss.mean() #Modified for GPU parallelism
        
        
        
        # The mean of losses across the evaluation iterations is stored for current data split
        #out[split] = losses.mean()
        out[split] = losses.mean().item() # Modified for Parallel GPU's to convert tensor to scalar
    
    model.train() # Model is switched back to train mode for further training
    return out # The function returns the out dictionary, which contains the average losses for the training and validation sets.

In [11]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False) # Key represents the information each element of sequence holds
        self.query = nn.Linear(n_embd, head_size, bias=False) # Defines the query or search the model is asking for
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)

        # masking out the attention scores in the upper triangular portion of the matrix, the model is forced to only focus on the elements that appear before the current position in the sequence.
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        # the model gives more weight to elements that have higher attention scores, meaning they are more relevant to the current query.
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

In [12]:
# Define RMS Normalization Block
class RMSNorm(nn.Module):
    def __init__(self, d_model, eps=1e-6):
        super().__init__()
        self.scale = nn.Parameter(torch.ones(d_model))
        self.eps = eps

    def forward(self, x):
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True))
        return self.scale * x / (rms + self.eps)

In [13]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.GELU(), ## GELU activation function
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size) #Initialization of self-multihead attention module defined earlier
        self.ffwd = FeedFoward(n_embd) # Initialization of feedforward module defined earlier
        #self.ln1 = nn.LayerNorm(n_embd) #Initialize first normalization layer
        #self.ln2 = nn.LayerNorm(n_embd)  #Initialize second normalization layer
        self.ln1 = RMSNorm(n_embd) #Initialize first normalization layer with RMS Norm
        self.ln2 = RMSNorm(n_embd)  #Initialize second normalization layer with RMS Norm

    def forward(self, x):
        x = x + self.sa(self.ln1(x)) #Input block (x) undergoes normalization followed by self attention
        x = x + self.ffwd(self.ln2(x)) # Output of previous block goes through normalization followed by feedforward layer followed
        return x

In [14]:
# Simple bigram model
class BigramLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd) # Word Embedding
        self.position_embedding_table = nn.Embedding(block_size, n_embd) # Positional Embedding
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)]) # Creates stack of n_layer transformer blocks using sequential container
        self.ln_f = nn.LayerNorm(n_embd) # final layer normalization
        self.lm_head = nn.Linear(n_embd, vocab_size) # Map normalized embedding outputs to logits which are used for loss calculation

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C) Token Embedding
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C) Positional Embedding
        x = tok_emb + pos_emb # (B,T,C) Embeddings added to create input sequence
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [15]:
# hyperparameters
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 512 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 200
learning_rate = 3e-4
eval_iters = 200
n_embd = 384
n_head = 8
n_layer = 6
dropout = 0.2

max_norm = 1.0 # maximum norm of the gradients. If the norm of the gradient exceeds this value, it will be scaled down to fit the norm limit.
lr_decay = 0.2 # Learning rate decays to lr_decay*learning rate every 1000 steps

#Optimizer parameters
wt_decay = 0.1 # regularization technique that penalizes large weights during training.
beta1 = 0.9 #  decay rate of the first moment estimates, the exponentially weighted average of past gradients
beta2 = 0.95 # decay rate of the second moment estimates, which is the exponentially weighted average of past squared gradients
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Use TPU if available, otherwise use CPU
# device = xm.xla_device() if xm.xla_device() is not None else 'cpu'


In [17]:
model = BigramLanguageModel()
model = nn.DataParallel(model)
m = model.to(device)

scaler = GradScaler() #  helps in adjusting the gradients to account for the reduced precision:

# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate, weight_decay=wt_decay, betas=(beta1, beta2))
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=2000, gamma=lr_decay)

# Start timing
stat_time = time.time()

for iter in range(max_iters):

    optimizer.zero_grad(set_to_none=True)  # Zero gradients at the start

    xb, yb = get_batch('train')  # Get a batch of data
    
    # Computations within this block will be done in float16 wherever possible using autocast().
    with autocast():
        logits, loss = model(xb, yb)
        loss = loss.mean()  # Ensure loss is a scalar (important for DataParallel), averaging over multiple GPU's

    # Backward pass with scaled loss
    scaler.scale(loss).backward()

    # Unscale gradients and perform gradient clipping
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=max_norm)

    # Step with scaler (performs optimizer step and updates scaler)
    scaler.step(optimizer)
    scaler.update()

    # Update learning rate
    scheduler.step()

    # Periodically evaluate loss
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

# Stop timing
end_time = time.time()

# Calculate and print training duration
training_time = end_time - start_time
print(f"Training completed in: {training_time:.2f} seconds")

# Generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.module.generate(context, max_new_tokens=2000)[0].tolist()))

10.882625 M parameters




step 0: train loss 3.7109, val loss 3.7242
step 200: train loss 2.4393, val loss 2.4739
step 400: train loss 2.2467, val loss 2.2952
step 600: train loss 2.0395, val loss 2.1214
step 800: train loss 1.8526, val loss 1.9870
step 1000: train loss 1.7098, val loss 1.8746
step 1200: train loss 1.6680, val loss 1.8449
step 1400: train loss 1.6441, val loss 1.8262
step 1600: train loss 1.6173, val loss 1.8003
step 1800: train loss 1.5923, val loss 1.7838
step 2000: train loss 1.5700, val loss 1.7647
step 2200: train loss 1.5639, val loss 1.7582
step 2400: train loss 1.5576, val loss 1.7530
step 2600: train loss 1.5543, val loss 1.7508
step 2800: train loss 1.5491, val loss 1.7497
step 3000: train loss 1.5457, val loss 1.7418
step 3200: train loss 1.5423, val loss 1.7399
step 3400: train loss 1.5423, val loss 1.7393
step 3600: train loss 1.5411, val loss 1.7388
step 3800: train loss 1.5394, val loss 1.7366
step 4000: train loss 1.5379, val loss 1.7397
step 4200: train loss 1.5396, val loss 1.

NameError: name 'start_time' is not defined

In [19]:
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.module.generate(context, max_new_tokens=2000)[0].tolist()))


CAMINIUS:
But the spersed:
And heaven he call to thy lover-dangue
Evooun steembren stoodd I my do awnin myou
In yet in the love frantured. Then let may prows: for
To father, noble and fathorsey consice what too I,
were though see the woulds of younsel moner.
Saituout, our?

HESTINBIO:
Sir.
DYew, my grave Mores aster.

DUKE VINCENV:
I would rece, main:
Gody me kepboke;
For 't a upon alst; you suck yof you sin,
Warick.

Nurse:
What have is her, if her, my be is i' so?

KING RICHARD II:
A trie Lorcase couble, Plift I kneep your comb;
Thand hims that thous natle seeech ans an done
Of this the surnnes itservisent.
No oupase you my Closingmans me
Here therer comere; I not at your times,
Yoou prown the crembum for heldsency'd, tere
To yet tland my atals, neears, fir that sand Lucender,
Fors: what, he havirt faults, wish, all pusiltss
'Take cretmain witde fir cripevibles.
That ex how he it; and not 'ow, this dong? wet that, not cusile, my by and:
Bensing you clasburs day amain?
Fathw heread h