# <center> nanoGPT: Exploring Applications of Language Models </center> <a class='tocSkip'>
MSDS 2023 Term 4 ML3 | **Maria Loraine R. Menorca**
    
**Learning Goals:**

1. What is a language model?
2. Describe the dataset being used. What preprocessing steps need to be done in preparation for training the model?
3. What is self-attention? 
4. Compare and contrast the concept of attention, self-attention, and cross-attention.
5. What is multi-head attention? 
6. What is a transformer? 
7. Describe the other components of a transformer: residual connections, layer normalization, and dropout. What purpose do each of them serve?
8. Tune the nanoGPT model (e.g., add more epochs, adjust batch size, learning rate, embedding dimensions, dropout, other hyperparameters, etc.). Compare a set of generated samples from before and after your tuning.
9/ As you experiment with tuning, describe your thought process. Which hyperparameters did you decide to adjust? What were your hypotheses for how it would affect the model?


![Banner](language_models.png)

# I. Introduction

Language models (LM) are statistical models that estimate the probability distributions of linguistic units, such as words or sentences [1]. There are two main categories of language models: count-based models, also known as N-gram LMs, and continuous space models like Neural LMs.

Transformers and attention mechanisms [2] are architectural components that a Language model can use to capture context and relationships between words or positions in a sequence. Some of the common applications of LMs are text completion and generation, machine translation, sentiment analysis, and  text summarization.

## Attention

Attention mechanisms [3] in neural networks are used to selectively focus on parts of the input or output when processing sequential data. It includes 3 components - a query, set of key-value pairs, and an output which are all represented as vectors. In attention mechanisms, the weights are determined by calculating the similarity between the query and the keys. The output is obtained by getting the weighted sum of the values.


Self-, Cross-, and Multi-Head attention are different flavors of attention used to improve an LM's performance.

***Self-attention*** [2, 4] is the ability of a model to refer to the same sequence and capture relationships between different positions within that sequence. That is, all of the keys, values, and queries come from the same place.

On the other hand, ***Cross-attention*** [5] allows a model to learn relationships between elements of different sequences. The query vectors are derived from one sequence, and the key-value vectors come from another sequence.

Finally, ***multi-head attention*** [2] is a stack of multiple attention heads running in parallel to attend to different parts of the input sequence. Each attention head independently calculates its own weights and output which are then combined and transformed to produce a final output within the layer.

## Transformers

Transformers [6, 7] are a type of neural network architecture that is useful in processing sequential data. These are designed to encode the context of an input sequence into a vector representaton and decode this information subsequently into another sequence. It consists of several key components including residual connections, layer normalization, and dropout.

***Residual connections*** [8] facilitate the flow of information throughout the layers of the network by using skip-connections. These are usually used to mitigate the vanishing gradient problem by providing shortcut paths for the gradients to flow directly from succeeding to preceeding layers. The vanishing gradient problem occurs when gradients become extremely small during backpropagation, preventing effective learning in deep neural networks.

***Layer normalization*** [9] serves the purpose of normalizing the outputs of each layer in the network by directly computing the normalization statistics from the summed inputs to the neurons within a hidden layer. Doing so avoids the introduction of additional dependencies or correlations between training cases and allows for more flexibility in batch processing.

***Dropout*** [10] is a popular regularization technique for fully-connected neural networks. It aims to prevent overfitting by randomly deactivating or "dropping out" a fraction of the neural units during training. By forcing the network to rely on the remaining active units, Dropout encourages the model to learn more robust and generalized patterns using the available connections and parameters.

# II. Application

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

torch.manual_seed(1337)
print(f"Device used: {'CUDA' if torch.cuda.is_available() else 'CPU'}")

Device used: CUDA


## Dataset

The project uses a sample subset of the [`Tiny Shakespeare`](https://cs.stanford.edu/people/karpathy/char-rnn/) dataset consisting of around **1.1M** characters and **65** unique characters extracted from the literary works of William Shakespeare.


In [2]:
# # We always start with a dataset to train on. Let's download the tiny shakespeare dataset
# !wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [3]:
# read it in to inspect it
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [4]:
print("length of dataset in characters: ", len(text))

length of dataset in characters:  1115394


In [5]:
# let's look at the first 1000 characters
print(text[:1000])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.

All:
We know't, we know't.

First Citizen:
Let us kill him, and we'll have corn at our own price.
Is't a verdict?

All:
No more talking on't; let it be done: away, away!

Second Citizen:
One word, good citizens.

First Citizen:
We are accounted poor citizens, the patricians good.
What authority surfeits on would relieve us: if they
would yield us but the superfluity, while it were
wholesome, we might guess they relieved us humanely;
but they think we are too dear: the leanness that
afflicts us, the object of our misery, is as an
inventory to particularise their abundance; our
sufferance is a gain to them Let us revenge this with
our pikes, ere we become rakes: for the gods know I
speak this in hunger for bread, not in thirst for revenge.



In [6]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)


 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


## Preprocessing steps

### Mapping

As a preliminary step, a mapping between characters and integers as indices are created for each token or string within the text. An encoder of strings into lists of integers, and a decoder of lists of integers back into strings were also defined for convenient conversion between textual and numerical data.

1. `stoi = { ch:i for i,ch in enumerate(chars) }` : This line creates a dictionary called stoi (string-to-index) that maps each character in the chars list to its corresponding index or integer value. The enumerate() function generates pairs of indices and characters, which are then used to create the dictionary.

2. `itos = { i:ch for i,ch in enumerate(chars) }` : This line creates a dictionary called itos (index-to-string) that maps each index to its corresponding character from the chars list.

3. `encode = lambda s: [stoi[c] for c in s]` : This line defines an encoder function named encode. It takes a string s as input and converts it into a list of integers by iterating over each character c in the string and retrieving its corresponding index from the stoi dictionary.

4. `decode = lambda l: ''.join([itos[i] for i in l])` : This line defines a decoder function named decode. It takes a list of integers l as input and converts it back into a string by iterating over each integer i in the list and retrieving its corresponding character from the itos dictionary. The retrieved characters are then joined together using the join() function to form the decoded string.

In [7]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

print(encode("hii there"))
print(decode(encode("hii there")))

[46, 47, 47, 1, 58, 46, 43, 56, 43]
hii there


### Text Encoding

The text dataset was then encoded as a PyTorch tensor allowing for efficient storage, manipulation, and processing for deep learning tasks.

`data = torch.tensor(encode(text), dtype=torch.long)` : This line encodes the `text` dataset, presumably a string of characters, using the previously defined `encode` function. The `encode` function converts the text into a list of integers representing the characters. The resulting list is then converted into a PyTorch tensor using `torch.tensor()`. The `dtype=torch.long` argument specifies that the tensor should have a data type of long, which is typically used for integer values.

In [8]:
# let's now encode the entire text dataset and store it into a torch.Tensor
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000]) # the 1000 characters we looked at earier will to the GPT look like this

torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

### Splitting & Batching

The encoded data was then split into training and validation sets based on a specified ratio, `n`. For this case, 90% of the entire dataset was used for training, and the rest for validation. A block size was also defined to determine the length of subsequences used for training or processing.

In [9]:
# Let's now split up the data into train and validation sets
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

# Define the block size
block_size = 8
train_data[:block_size+1]

tensor([18, 47, 56, 57, 58,  1, 15, 47, 58])

In [10]:
x = train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
    context = x[:t+1]
    target = y[t]
    print(f"when input is {context} the target: {target}")

when input is tensor([18]) the target: 47
when input is tensor([18, 47]) the target: 56
when input is tensor([18, 47, 56]) the target: 57
when input is tensor([18, 47, 56, 57]) the target: 58
when input is tensor([18, 47, 56, 57, 58]) the target: 1
when input is tensor([18, 47, 56, 57, 58,  1]) the target: 15
when input is tensor([18, 47, 56, 57, 58,  1, 15]) the target: 47
when input is tensor([18, 47, 56, 57, 58,  1, 15, 47]) the target: 58


In [11]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----')

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[24, 43, 58,  5, 57,  1, 46, 43],
        [44, 53, 56,  1, 58, 46, 39, 58],
        [52, 58,  1, 58, 46, 39, 58,  1],
        [25, 17, 27, 10,  0, 21,  1, 54]])
targets:
torch.Size([4, 8])
tensor([[43, 58,  5, 57,  1, 46, 43, 39],
        [53, 56,  1, 58, 46, 39, 58,  1],
        [58,  1, 58, 46, 39, 58,  1, 46],
        [17, 27, 10,  0, 21,  1, 54, 39]])
----
when input is [24] the target: 43
when input is [24, 43] the target: 58
when input is [24, 43, 58] the target: 5
when input is [24, 43, 58, 5] the target: 57
when input is [24, 43, 58, 5, 57] the target: 1
when input is [24, 43, 58, 5, 57, 1] the target: 46
when input is [24, 43, 58, 5, 57, 1, 46] the target: 43
when input is [24, 43, 58, 5, 57, 1, 46, 43] the target: 39
when input is [44] the target: 53
when input is [44, 53] the target: 56
when input is [44, 53, 56] the target: 1
when input is [44, 53, 56, 1] the target: 58
when input is [44, 53, 56, 1, 58] the target: 46
when input is [44, 53

## Models

### Bigram

A Bigram language model that can be used for both training using the calculated loss, and generating new sequences by sampling from a distribution was defined. 

1. The `forward` method takes an input sequence `idx` and optional `targets` for training. It computes the logits (unnormalized probabilities) for the next token prediction based on the input sequence. If `target`s are provided, it calculates the cross-entropy loss between the predicted logits and the actual targets.

2. The `generate` method generates new tokens given an input sequence `idx` and a maximum number of tokens to generate `max_new_tokens`. It repeatedly predicts the next token by sampling from the softmax distribution over the logits and appends the sampled token to the running sequence. The generated sequence is returned as the output.

In [12]:
class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)
        
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss
    
    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [13]:
m = BigramLanguageModel(vocab_size)
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

torch.Size([32, 65])
tensor(5.0364, grad_fn=<NllLossBackward0>)

l-QYjt'CL?jLDuQcLzy'RIo;'KdhpV
vLixa,nswYZwLEPS'ptIZqOZJ$CA$zy-QTkeMk x.gQSFCLg!iW3fO!3DGXAqTsq3pdgq


For this case, `AdamW` was chosen as the optimization algorithm, which is an extension of the Adam optimizer with weight decay regularization.

In [14]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

The neural network model was then trained using mini-batch gradient descent.

In each step, a batch of training data (`xb` and `yb`) is sampled using the `get_batch` function. The batch size is set to 32, meaning that 32 input sequences and their corresponding targets are processed together.

The model `m` is then used to compute the logits (unnormalized probabilities) and the associated loss between the predicted logits and the actual targets. This is done by calling `m(xb, yb)`, which returns the logits and the loss.

Before computing the gradients, the optimizer's `zero_grad` function is called to reset the gradients of the model parameters to zero. This is necessary because PyTorch accumulates gradients by default for each parameter, so we need to clear the gradients from the previous iteration.

The loss is then backpropagated through the model using `loss.backward()`, which computes the gradients of the loss with respect to the model parameters.

Finally, the optimizer's `step` function is called to update the model parameters based on the computed gradients and the chosen optimization algorithm (AdamW in this case). This step performs the gradient descent update, adjusting the model parameters to minimize the loss.

After completing all the steps, the final loss value (`loss.item()`) is printed, representing the loss achieved after the last training iteration.

In [15]:
batch_size = 32
for steps in range(100): # increase number of steps for good results... 
    
    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = m(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print(loss.item())

4.574859619140625


In [16]:
# Sample 
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))


lROZtgqVuoCbMfq!H.ukmspuW,OM!,OFfmMyg;.JknOhmyQGI&fgPhLyuWYdGXaaIo&UZ:o;tpOMdFG!vHQImqSq!hON
O?Zz;?enfCAY;X&.Omq$RQXN3McnaZ
tFGXAY'SYQXcPcnXg,Optp3?!AQwaiuglfjtJM-U3f:H:HY,:OAOECWS;mGGJsn3GJsuM.iajbt$V-GGCg .uPLghOnoPGU.SXJWD?sXLjGJYjtNA,-'N
Y;oqZLXATajooKrH
lXV-GCF&kHK&L,O?MqUigko;?zxsCCaRfbuzxn y3Vn:OyxWtXaG?erT'3KFGBexFZ?e:H!d:H.urm&TT iq3YD
H PyjNEFpkEPyDuFww.!PW,RPzN
kWDdnz&xGtJe!mtREFEzyd.BeXhydOVD3KKpkZYZE,..FvV,IRub-Yci&jU;aX;H&.GMHxJIIm Cg;tpZEjgl.Y$Oerz?qarrKl;oCCyGX;t:Ehop-ACb.zRh3ptq


### Self-attention

Self-attention was performed using the `Head` module by projecting the input tensor, calculating attention scores, applying a triangular mask, computing attention weights, and aggregating the values.

In the forward pass, the input tensor `x` is projected to obtain key (`k`), query (`q`), and value (`v`) tensors. Attention scores are then computed by taking the dot product of query and key tensors, followed by scaling with `C` (the square root of the hidden dimension).

The triangular mask is applied to the attention scores to prevent attending to future positions in the sequence.
The attention scores are passed through a softmax function to obtain attention weights (`wei`), which are then subjected to dropout.

Finally, the values (`v`) are weighted by the attention weights (`wei`) and aggregated to produce the output tensor (`out`).

In [17]:
# version 4: self-attention!
B,T,C = 4,8,32 # batch, time, channels
x = torch.randn(B,T,C)

# let's see a single Head perform self-attention
head_size = 16
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = nn.Linear(C, head_size, bias=False)
k = key(x)   # (B, T, 16)
q = query(x) # (B, T, 16)
wei =  q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)

tril = torch.tril(torch.ones(T, T))
#wei = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = F.softmax(wei, dim=-1)

v = value(x)
out = wei @ v
#out = wei @ x

out.shape

torch.Size([4, 8, 16])

In [18]:
wei[0]

tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.9824, 0.0176, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0917, 0.6751, 0.2332, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.1937, 0.1384, 0.4777, 0.1902, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0801, 0.2292, 0.0693, 0.1070, 0.5143, 0.0000, 0.0000, 0.0000],
        [0.0671, 0.4667, 0.1121, 0.0808, 0.2024, 0.0709, 0.0000, 0.0000],
        [0.0193, 0.1579, 0.0204, 0.1551, 0.0546, 0.0450, 0.5476, 0.0000],
        [0.0711, 0.4844, 0.2670, 0.0235, 0.0388, 0.0146, 0.0345, 0.0661]],
       grad_fn=<SelectBackward0>)

In [19]:
k = torch.randn(B,T,head_size)
q = torch.randn(B,T,head_size)
wei = q @ k.transpose(-2, -1) * head_size**-0.5

In [20]:
k.var()

tensor(0.9051)

In [21]:
q.var()

tensor(0.9981)

In [22]:
wei.var()

tensor(0.9188)

In [23]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1)

tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])

In [24]:
torch.softmax(torch.tensor([0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot

tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])

In [25]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

### Multi-headed Attention

Parallel self-attention computations using multiple heads was performed using the `MultiHeadAttention` module, while the `FeedForward` module applies a non-linear transformation to the input tensor. 


`MultiHeadAttention` : This module takes an input tensor `x` and performs self-attention computations using multiple instances of the `Head` module (defined in the previous subsection). The outputs of the individual heads are concatenated and passed through a linear projection layer and a dropout layer.

`FeedForward` : This module represents a simple feed-forward network. It consists of two linear layers with a ReLU activation function between them. The input tensor `x` is passed through these layers to transform the features. A dropout layer is applied after the second linear layer.

In [26]:
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(n_embd, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out
    
class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

### Transformer

The `Block` module encapsulates the communication and computation steps of a transformer block, where self-attention is performed followed by a feed-forward network transformation.

In [27]:
class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

### LayerNorm

The 1LayerNorm1d1 module performs layer normalization by normalizing the input tensor along the second dimension (batch dimension) using learned parameters (`gamma` and `beta`), resulting in a normalized output tensor with the same shape as the input.

In [28]:
class LayerNorm1d: # (used to be BatchNorm1d)
  
    def __init__(self, dim, eps=1e-5, momentum=0.1):
        self.eps = eps
        self.gamma = torch.ones(dim)
        self.beta = torch.zeros(dim)

    def __call__(self, x):
        # calculate the forward pass
        xmean = x.mean(1, keepdim=True) # batch mean
        xvar = x.var(1, keepdim=True) # batch variance
        xhat = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
        self.out = self.gamma * xhat + self.beta
        return self.out

    def parameters(self):
        return [self.gamma, self.beta]

torch.manual_seed(1337)
module = LayerNorm1d(100)
x = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x.shape

torch.Size([32, 100])

In [29]:
x[:,0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs

(tensor(0.1469), tensor(0.8803))

## Complete Pipeline

In [30]:
def nanogpt(text, batch_size=16, block_size=32, max_iters=5000,
            eval_interval=100, learning_rate=1e-3, device='cuda',
            eval_iters=200, n_embd=64, n_head=4, n_layer=4, dropout=0.0):
    """
    This function trains a simplified version of the GPT language model called Nano GPT.
    It takes a text corpus as input and trains a language model that can generate new text
    similar to the input text.

    Parameters:
    -------------------
    text (str):
        The input text corpus used for training the language model.
    batch_size (int):
        The batch size for training data. Default is 16.
    block_size (int):
        The sequence length or context size for each training sample. Default is 32.
    max_iters (int):
        The maximum number of training iterations. Default is 5000.
    eval_interval (int):
        The interval at which to evaluate the loss on train and validation sets. Default is 100.
    learning_rate (float):
        The learning rate for the optimizer. Default is 1e-3.
    device (str):
        The device to run the training on. Default is 'cuda'.
    eval_iters (int):
        The number of iterations to estimate the loss on train and validation sets. Default is 200.
    n_embd (int):
        The embedding dimension of the language model. Default is 64.
    n_head (int):
        The number of attention heads in the multi-head attention mechanism. Default is 4.
    n_layer (int):
        The number of transformer blocks in the language model. Default is 4.
    dropout (float):
        The dropout probability for regularization. Default is 0.0.
    """
    torch.manual_seed(1337)

    # here are all the unique characters that occur in this text
    chars = sorted(list(set(text)))
    vocab_size = len(chars)
    # create a mapping from characters to integers
    stoi = { ch:i for i,ch in enumerate(chars) }
    itos = { i:ch for i,ch in enumerate(chars) }
    encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
    decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

    # Train and test splits
    data = torch.tensor(encode(text), dtype=torch.long)
    n = int(0.9*len(data)) # first 90% will be train, rest val
    train_data = data[:n]
    val_data = data[n:]

    # data loading
    def get_batch(split):
        # generate a small batch of data of inputs x and targets y
        data = train_data if split == 'train' else val_data
        ix = torch.randint(len(data) - block_size, (batch_size,))
        x = torch.stack([data[i:i+block_size] for i in ix])
        y = torch.stack([data[i+1:i+block_size+1] for i in ix])
        x, y = x.to(device), y.to(device)
        return x, y

    @torch.no_grad()
    def estimate_loss():
        out = {}
        model.eval()
        for split in ['train', 'val']:
            losses = torch.zeros(eval_iters)
            for k in range(eval_iters):
                X, Y = get_batch(split)
                logits, loss = model(X, Y)
                losses[k] = loss.item()
            out[split] = losses.mean()
        model.train()
        return out

    class Head(nn.Module):
        """ one head of self-attention """

        def __init__(self, head_size):
            super().__init__()
            self.key = nn.Linear(n_embd, head_size, bias=False)
            self.query = nn.Linear(n_embd, head_size, bias=False)
            self.value = nn.Linear(n_embd, head_size, bias=False)
            self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

            self.dropout = nn.Dropout(dropout)

        def forward(self, x):
            B,T,C = x.shape
            k = self.key(x)   # (B,T,C)
            q = self.query(x) # (B,T,C)
            # compute attention scores ("affinities")
            wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
            wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
            wei = F.softmax(wei, dim=-1) # (B, T, T)
            wei = self.dropout(wei)
            # perform the weighted aggregation of the values
            v = self.value(x) # (B,T,C)
            out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
            return out

    class MultiHeadAttention(nn.Module):
        """ multiple heads of self-attention in parallel """

        def __init__(self, num_heads, head_size):
            super().__init__()
            self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
            self.proj = nn.Linear(n_embd, n_embd)
            self.dropout = nn.Dropout(dropout)

        def forward(self, x):
            out = torch.cat([h(x) for h in self.heads], dim=-1)
            out = self.dropout(self.proj(out))
            return out

    class FeedFoward(nn.Module):
        """ a simple linear layer followed by a non-linearity """

        def __init__(self, n_embd):
            super().__init__()
            self.net = nn.Sequential(
                nn.Linear(n_embd, 4 * n_embd),
                nn.ReLU(),
                nn.Linear(4 * n_embd, n_embd),
                nn.Dropout(dropout),
            )

        def forward(self, x):
            return self.net(x)

    class Block(nn.Module):
        """ Transformer block: communication followed by computation """

        def __init__(self, n_embd, n_head):
            # n_embd: embedding dimension, n_head: the number of heads we'd like
            super().__init__()
            head_size = n_embd // n_head
            self.sa = MultiHeadAttention(n_head, head_size)
            self.ffwd = FeedFoward(n_embd)
            self.ln1 = nn.LayerNorm(n_embd)
            self.ln2 = nn.LayerNorm(n_embd)

        def forward(self, x):
            x = x + self.sa(self.ln1(x))
            x = x + self.ffwd(self.ln2(x))
            return x

    # super simple bigram model
    class BigramLanguageModel(nn.Module):

        def __init__(self):
            super().__init__()
            # each token directly reads off the logits for the next token from a lookup table
            self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
            self.position_embedding_table = nn.Embedding(block_size, n_embd)
            self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
            self.ln_f = nn.LayerNorm(n_embd) # final layer norm
            self.lm_head = nn.Linear(n_embd, vocab_size)

        def forward(self, idx, targets=None):
            B, T = idx.shape

            # idx and targets are both (B,T) tensor of integers
            tok_emb = self.token_embedding_table(idx) # (B,T,C)
            pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
            x = tok_emb + pos_emb # (B,T,C)
            x = self.blocks(x) # (B,T,C)
            x = self.ln_f(x) # (B,T,C)
            logits = self.lm_head(x) # (B,T,vocab_size)

            if targets is None:
                loss = None
            else:
                B, T, C = logits.shape
                logits = logits.view(B*T, C)
                targets = targets.view(B*T)
                loss = F.cross_entropy(logits, targets)

            return logits, loss

        def generate(self, idx, max_new_tokens):
            # idx is (B, T) array of indices in the current context
            for _ in range(max_new_tokens):
                # crop idx to the last block_size tokens
                idx_cond = idx[:, -block_size:]
                # get the predictions
                logits, loss = self(idx_cond)
                # focus only on the last time step
                logits = logits[:, -1, :] # becomes (B, C)
                # apply softmax to get probabilities
                probs = F.softmax(logits, dim=-1) # (B, C)
                # sample from the distribution
                idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
                # append sampled index to the running sequence
                idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
            return idx

    model = BigramLanguageModel()
    m = model.to(device)
    # print the number of parameters in the model
    print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

    # create a PyTorch optimizer
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

    for iter in range(max_iters):

        # every once in a while evaluate the loss on train and val sets
        if iter % eval_interval == 0 or iter == max_iters - 1:
            losses = estimate_loss()
            print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

        # sample a batch of data
        xb, yb = get_batch('train')

        # evaluate the loss
        logits, loss = model(xb, yb)
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

    # generate from the model
    context = torch.zeros((1, 1), dtype=torch.long, device=device)
    print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


# III. Results

In [31]:
# wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open('input.txt', 'r', encoding='utf-8') as f:
    raw_text = f.read()

## Baseline

In [32]:
nanogpt(raw_text)

0.209729 M parameters
step 0: train loss 4.4116, val loss 4.4022
step 100: train loss 2.6568, val loss 2.6670
step 200: train loss 2.5090, val loss 2.5058
step 300: train loss 2.4194, val loss 2.4335
step 400: train loss 2.3505, val loss 2.3569
step 500: train loss 2.2965, val loss 2.3129
step 600: train loss 2.2410, val loss 2.2500
step 700: train loss 2.2047, val loss 2.2186
step 800: train loss 2.1635, val loss 2.1868
step 900: train loss 2.1238, val loss 2.1503
step 1000: train loss 2.1024, val loss 2.1289
step 1100: train loss 2.0705, val loss 2.1189
step 1200: train loss 2.0396, val loss 2.0808
step 1300: train loss 2.0243, val loss 2.0631
step 1400: train loss 1.9928, val loss 2.0369
step 1500: train loss 1.9699, val loss 2.0306
step 1600: train loss 1.9627, val loss 2.0476
step 1700: train loss 1.9412, val loss 2.0150
step 1800: train loss 1.9098, val loss 1.9967
step 1900: train loss 1.9082, val loss 1.9873
step 2000: train loss 1.8838, val loss 1.9931
step 2100: train loss 1.

## Hyperparameter tuning

### Case 1

Default parameters, increased `max_iters` from 5000 to 10000.

***Assumptions:***

Increasing the maximum iterations might give the model more opportunity to learn from the training data and improve its performance. Ideally, the effect of varying number of iterations should be monitored until the loss becomes stable or stops changing.

As a con, doing so increases the training time and may have the risk of overfitting since the model could start to memorize the training data instead of learning general patterns.

In [33]:
# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 10000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

In [34]:
nanogpt(text, batch_size, block_size, max_iters,
        eval_interval, learning_rate, device,
        eval_iters, n_embd, n_head, n_layer, dropout)

0.209729 M parameters
step 0: train loss 4.4116, val loss 4.4022
step 100: train loss 2.6568, val loss 2.6670
step 200: train loss 2.5090, val loss 2.5058
step 300: train loss 2.4194, val loss 2.4335
step 400: train loss 2.3505, val loss 2.3569
step 500: train loss 2.2965, val loss 2.3129
step 600: train loss 2.2410, val loss 2.2500
step 700: train loss 2.2047, val loss 2.2186
step 800: train loss 2.1635, val loss 2.1868
step 900: train loss 2.1238, val loss 2.1503
step 1000: train loss 2.1024, val loss 2.1289
step 1100: train loss 2.0705, val loss 2.1189
step 1200: train loss 2.0396, val loss 2.0808
step 1300: train loss 2.0243, val loss 2.0631
step 1400: train loss 1.9928, val loss 2.0369
step 1500: train loss 1.9699, val loss 2.0306
step 1600: train loss 1.9627, val loss 2.0476
step 1700: train loss 1.9412, val loss 2.0150
step 1800: train loss 1.9098, val loss 1.9967
step 1900: train loss 1.9082, val loss 1.9873
step 2000: train loss 1.8838, val loss 1.9931
step 2100: train loss 1.

### Case 2

Default parameters, increased `block_size` from 32 to 64.

***Assumption:***

Increasing the block size can help the model capture longer-term dependencies since it can capture longer sequences of text. This can be useful in this case since generating coherent paragraphs or storyline is important.

Making this change may also increase memory and computational requirements since the model needs to process and store larger sequences. Moreover, introducing larger amount of information may overwhem the model and make it harder to effectively capture relevant patterns. 

In [35]:
# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 64 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 4
n_layer = 4
dropout = 0.0
# ------------

In [36]:
nanogpt(text, batch_size, block_size, max_iters,
        eval_interval, learning_rate, device,
        eval_iters, n_embd, n_head, n_layer, dropout)

0.211777 M parameters
step 0: train loss 4.3391, val loss 4.3483
step 100: train loss 2.6521, val loss 2.6653
step 200: train loss 2.5204, val loss 2.5179
step 300: train loss 2.4497, val loss 2.4702
step 400: train loss 2.3791, val loss 2.3923
step 500: train loss 2.3265, val loss 2.3391
step 600: train loss 2.2855, val loss 2.2954
step 700: train loss 2.2419, val loss 2.2676
step 800: train loss 2.2122, val loss 2.2333
step 900: train loss 2.1666, val loss 2.1923
step 1000: train loss 2.1302, val loss 2.1585
step 1100: train loss 2.0901, val loss 2.1322
step 1200: train loss 2.0550, val loss 2.1145
step 1300: train loss 2.0388, val loss 2.0796
step 1400: train loss 2.0033, val loss 2.0634
step 1500: train loss 1.9718, val loss 2.0452
step 1600: train loss 1.9616, val loss 2.0292
step 1700: train loss 1.9292, val loss 2.0132
step 1800: train loss 1.9059, val loss 1.9840
step 1900: train loss 1.8909, val loss 1.9804
step 2000: train loss 1.8743, val loss 1.9641
step 2100: train loss 1.

### Case 3

Default parameters, increased `n_head` from 4 to 8.

***Assumption:***

Increasing the block size can help the model capture longer-term dependencies since it can capture longer sequences of text. The model would also become more "expressive" and flexible since it's able to learn more diverse representations. This can be useful in this case since generating coherent paragraphs or storyline is important.

As a disadvantage, training would require more memory and computational resources. There is also a potential to overfit when the model learns patterns that are too specific to the training data.

In [37]:
# hyperparameters
batch_size = 16 # how many independent sequences will we process in parallel?
block_size = 32 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 100
learning_rate = 1e-3
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 64
n_head = 8
n_layer = 4
dropout = 0.0
# ------------

In [38]:
nanogpt(text, batch_size, block_size, max_iters,
        eval_interval, learning_rate, device,
        eval_iters, n_embd, n_head, n_layer, dropout)

0.209729 M parameters
step 0: train loss 4.4322, val loss 4.4217
step 100: train loss 2.6608, val loss 2.6740
step 200: train loss 2.5182, val loss 2.5130
step 300: train loss 2.4444, val loss 2.4574
step 400: train loss 2.3802, val loss 2.3877
step 500: train loss 2.3223, val loss 2.3392
step 600: train loss 2.2742, val loss 2.2834
step 700: train loss 2.2369, val loss 2.2463
step 800: train loss 2.1988, val loss 2.2203
step 900: train loss 2.1566, val loss 2.1776
step 1000: train loss 2.1297, val loss 2.1519
step 1100: train loss 2.1014, val loss 2.1424
step 1200: train loss 2.0642, val loss 2.1038
step 1300: train loss 2.0508, val loss 2.0854
step 1400: train loss 2.0240, val loss 2.0629
step 1500: train loss 1.9996, val loss 2.0578
step 1600: train loss 1.9832, val loss 2.0673
step 1700: train loss 1.9708, val loss 2.0401
step 1800: train loss 1.9382, val loss 2.0297
step 1900: train loss 1.9300, val loss 2.0038
step 2000: train loss 1.9136, val loss 2.0203
step 2100: train loss 1.

# IV. Conclusion

<span style="font-size: 14px">
    <center><b>Comparison of model performance using different parameters</b></center>
</span>
<table>
    <thead>
        <tr>
            <th>Model</th>
            <th>Run time (mins., s)</th>
            <th>Train Loss</th>
            <th>Validation Loss</th>
            <th>Val. Loss Diff vs. Baseline (%)</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <th>Baseline</th>
            <td>5m, 58s</td>
            <td>1.6647</td>
            <td>1.8235</td>
            <td>-</td>
        </tr>
        <tr style="background-color: rgb(255, 255, 204);">
            <th>Case 1</th>
            <th>11m, 31s</th>
            <th>1.5636</th>
            <th>1.7391</th>
            <td>(-) 4.63%</td>
        </tr>
        <tr>
            <th>Case 2</th>
            <td>5m, 45s</td>
            <td>1.6021</td>
            <td>1.7830</td>
            <td>(-) 2.22%</td>
        </tr>
        <tr>
            <th>Case 3</th>
            <td>8m, 33s</td>
            <td>1.6844</td>
            <td>1.8442</td>
            <td>(-) 1.14%</td>
        </tr>      
</table>


As shown in the table above, ***Case 1*** or increasing the maximum number if iterations generated the most significant decrease in validation loss vs. the baseline model among the scenarios explored. However, it's important to note that this improvement comes at the expense of a longer run time. The increased number of iterations led to a doubling of the training time, taking ~12 mins. compared to the baseline's ~6 mins.

This trade-off between improved performance and increased computational time is a common consideration in training machine learning models. It is crucial to assess whether the performance gains obtained by increasing the number of iterations justify the additional time investment. The decision to increase the maximum number of iterations should be made based on the specific requirements of the task, the available computational resources, and the desired balance between model performance and efficiency.

***Baseline***

And they bride with that yet King thou was to take Ourtuned?
It us bartht he usque, to bardetle
Hate away, my fears' comzorm he owns,
Hof is heart milending, and if ensent,
A latistriviov the does me now on you so, like die; litthus wonchiry:
Auf the speak you love's nor
To this deserving would that
To Winsught their as to them, His The shire
And Let were to
To knom thrugh fir tression must wind.


***Case 1***

When thy bride will kiss;
And mad yet beauth enanting. Sight, hangs; and have will, to beceding
Hate away, my fears are zosom:
Young, to find I commil; I like
es it ensel cin latest in overs, and in the woeld
jess, lesing me up the rece wity: therein speak yea:
That I can pate a lown morld, dellow?


***Case 2***

YORK:
Ricking will that year madise, buble weranty them me?
That such happe us comment? we that anes will my feans,
You orm heavens, toful the covert;
And butes if ensent, wilatise in overs,
He wife now on than twels: now, all lise, cours:
To carraiss hew yet lorn'd nor
To tell death:
I lood mother
To Wild to do piing o' man doubt, and shore
And of my heart
To kindnest firs so;
And he must wre male ofte,
Mades of my offer froul
Have you arm adand the Edwarms:
if courtear tey it? the hand for his need.
Ponce, you see, what you sorrow. When-wen;
There with ready shall bling win the cours.

***Case 3***

if courtear tey it? the hand for his need.
Ponce, you see, what you sorrow. When-wen;
There with ready shall bling win the cours.
	


All before
Thow and is someth
backen bube to take One my dalina
My art that usquet to barderlancate away, my fackstary zorman
Your prooffice you have now
Whigef it entengmining is the overs, and
Will may is was twelll not, and thus, come by prave aiss hiw youngs
Has norfoldess togent:
Gllood mettake only son her evily, what
For His hangs in his somelien; thus pray nonter be son; if his shall no fled
And, and are grone my fright
Hastingdion
Should the Edwarn his best asare:
your his change care, time you bembry.
You contrantym son, and sevien these were throand?

For qualitative comparison, the first paragraphs generated from each case were compared with that of the baseline. As shown in the snippets above, ***Case 1*** stood out as it produced the most coherent and sensible text among the three cases explored. The generated texts using Case 1 also appeared to be similar to those generated by the baseline model, maintaining the overall coherence and meaning.

# V. Generative AI Documentation

ChatGPT was used in this work for the following:

1. Proof read the explanations provided for the concepts.

2. Simplified explanations of complex topics (e.g., Transformers; Attention)

3. Generate the documentation for certain code blocks.

4. Generate explanations on how certain code blocks work.


# VI. References

[1] Synced. (2018, June 7). Language Model: A Survey of the State-of-the-Art Technology. Medium. https://medium.com/syncedreview/language-model-a-survey-of-the-state-of-the-art-technology-64d1a2e5a466

[2] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. In arXiv (Cornell University) (Vol. 30, pp. 5998–6008). Cornell University. https://arxiv.org/pdf/1706.03762v5

[3] Luong, M. (2015, August 17). Effective Approaches to Attention-based Neural Machine Translation. arXiv.org. https://arxiv.org/abs/1508.04025

[4] Wydmanski, W. (2022, December 30). Self attention vs attention in transformers | MLearning.ai. Medium. https://medium.com/mlearning-ai/whats-the-difference-between-self-attention-and-attention-in-transformer-architecture-3780404382f3

[5] Kalra, G. (2022, July 18). Attention Networks: A simple way to understand Cross-Attention. Medium. https://medium.com/@geetkal67/attention-networks-a-simple-way-to-understand-cross-attention-3b396266d82e

[6] Alammar, J. (n.d.). The Illustrated Transformer. https://jalammar.github.io/illustrated-transformer/

[7] Alammar, J. (n.d.-b). Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention). https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

[8] Papers with Code - Residual Connection Explained. (n.d.). https://paperswithcode.com/method/residual-connection

[9] Papers with Code - Layer Normalization Explained. (n.d.). https://paperswithcode.com/method/layer-normalization

[10] Luo, J. (2021). Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation. https://www.semanticscholar.org/paper/Dropout-Regularization-for-Self-Supervised-Learning-Luo-Wang/d918c11715bf8e24a81b4988916e8478c970deee