In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [None]:
batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
max_iters = 5000
eval_interval = 500
learning_rate = 3e-4
device = 'cuda' if torch.cuda.is_available() else 'cpu'
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2

n_embd, n_head, n_layer, dropout:

Explanation: These parameters define the architecture of the Transformer-based language model. n_embd is the embedding dimension, n_head is the number of attention heads in the multi-head attention mechanism, n_layer is the number of transformer blocks in the model, and dropout is the probability of dropping out units during training, serving as a regularization technique to prevent overfitting.

Transformer-based models, introduced by Vaswani et al., are a type of neural network architecture designed for sequence-to-sequence tasks. They utilize self-attention mechanisms, eliminating the need for recurrence, making them highly parallelizable.

An embedding is a vector representation of a discrete item, often used in natural language processing to convert categorical data (like words or characters) into a continuous vector space.

In [None]:
torch.manual_seed(1337)

<torch._C.Generator at 0x7e9cf1775690>

In [None]:
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-01-13 08:48:59--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt’


2024-01-13 08:49:00 (35.8 MB/s) - ‘input.txt’ saved [1115394/1115394]



In [None]:
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()

In [None]:
chars = sorted(list(set(text)))
vocab_size = len(chars)
vocab_size

65

In [None]:
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]


In [None]:
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

In [None]:
@torch.no_grad()# PyTorch decorator used to disable gradient computation temporarily
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out



The Head class defines one head of self-attention.
It has linear layers (key, query, and value) to project the input tensor x into key, query, and value spaces.
The tril buffer is a lower triangular matrix used for masking in self-attention.
The forward method computes the attention scores, applies masking, performs softmax, and aggregates the values, implementing the self-attention mechanism.
In summary, this code implements a self-attention head within the transformer model and provides a function to estimate the average loss on the training and validation sets while temporarily disabling gradient computation during the evaluation.


**SELF -ATTENTION MECHANISM**



In [None]:
class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # input of size (batch, time-step, channels)
        # output of size (batch, time-step, head size)
        B,T,C = x.shape
        k = self.key(x)   # (B,T,hs)
        q = self.query(x) # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T) This involves taking the dot product of the query (q) and key (k) tensors, dividing
        # by the square root of the dimension of the key (k), and applying a mask to the lower triangular part of the attention matrix to make future positions attend
        #to the past only.
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,hs)
        out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out

super().__init__() calls the constructor of the parent class (nn.Module). It's necessary for proper initialization.

self.key, self.query, and self.value are linear layers responsible for projecting the input tensor into key, query, and value spaces, respectively. These projections are crucial for the self-attention mechanism.

nn.Linear(n_embd, head_size, bias=False) creates a linear layer with no bias term, where n_embd is the input size and head_size is the output size.

self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) registers a buffer named tril in the module. Buffers are persistent and included in the state_dict but not considered parameters. In this case, tril is a lower triangular matrix created using torch.tril and initialized with ones.

self.dropout = nn.Dropout(dropout) creates a dropout layer with a dropout probability specified by the dropout variable. Dropout is a regularization technique used to prevent overfitting by randomly setting a fraction of input units to zero during training.

FORWARD

The forward function in the Head class aims to perform self-attention on the input tensor x. Self-attention is a mechanism that allows the model to weigh different positions of the input differently when making predictions, considering dependencies between different positions in the sequence.

Here's a breakdown of what the function does:

Linear Transformations:

The input tensor x undergoes linear transformations using three linear layers (key, query, value). These transformations create three tensors: k (key), q (query), and v (value), each with a size of (batch, time-step, head size).
Attention Scores Calculation:

Attention scores, often referred to as "affinities," are computed using the dot product between the query (q) and key (k) tensors. The result is scaled by the square root of the dimension of the key (k). This step calculates how much focus each element in the sequence should place on the others. The output is a tensor (wei) of size (batch, time-step, time-step).
Masking:

The lower triangular part of the attention matrix is set to -inf. This masking ensures that each position attends only to positions at or before it, preventing information flow from future positions to past positions.
Softmax Activation:

The softmax activation is applied along the last dimension to obtain normalized attention weights (wei). This ensures that the weights sum to 1 along the time-step dimension.
Dropout:

Dropout is applied to the attention weights for regularization. This helps prevent overfitting by randomly setting some of the weights to zero during training.
Weighted Aggregation:

The values (v) are linearly transformed, and the result is aggregated based on the computed attention weights (wei). This step calculates a weighted sum of values for each position in the sequence.
Output:

The final output (out) represents the result of the self-attention mechanism for the given input tensor x

**MULTIPLE HEADS OF SELF ATTENTION IN PARALLEL**

In [None]:

class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

class FeedFoward(nn.Module):
    """ a simple linear layer followed by a non-linearity """

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)

class Block(nn.Module):
    """ Transformer block: communication followed by computation """

    def __init__(self, n_embd, n_head):
        # n_embd: embedding dimension, n_head: the number of heads we'd like
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedFoward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

class GPTLanguageModel(nn.Module):

    def __init__(self):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd) # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)

        # better init, not covered in the original GPT video, but important, will cover in followup video
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # idx and targets are both (B,T) tensor of integers
        tok_emb = self.token_embedding_table(idx) # (B,T,C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
        x = tok_emb + pos_emb # (B,T,C)
        x = self.blocks(x) # (B,T,C)
        x = self.ln_f(x) # (B,T,C)
        logits = self.lm_head(x) # (B,T,vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # get the predictions
            logits, loss = self(idx_cond)#logits->unnormalised probabilities
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

The MultiHeadAttention class is designed to apply multiple heads of self-attention in parallel and then aggregate their outputs. This is a key component of the transformer architecture, allowing the model to capture diverse patterns and dependencies in the input sequence.

Here's an explanation of the class:

Initialization:

The __init__ method initializes the MultiHeadAttention module.
It takes two parameters: num_heads and head_size.
It creates a list of num_heads instances of the Head class using nn.ModuleList. Each head has its own set of learnable parameters for the self-attention mechanism.
The proj linear layer is used to linearly combine the outputs of the individual heads. The output dimension is n_embd to maintain consistency with the overall model architecture.
Dropout is applied to the aggregated output for regularization.
Forward Pass:

The forward method takes an input tensor x and applies each head's self-attention mechanism independently.
The outputs of the individual heads are concatenated along the last dimension (dim=-1).
The concatenated output is linearly transformed using the proj linear layer.
Dropout is applied to the aggregated output for regularization.
The final output represents the result of applying multiple heads of self-attention to the input.

In [None]:

model = GPTLanguageModel()
m = model.to(device)

In [None]:
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters') #calculates and prints the total number of parameters in the GPT language model (m).

10.788929 M parameters


In [None]:
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)#AdamW is a variant of the Adam optimizer that includes weight decay, which is a form
# of regularization. The "W" in AdamW stands for "Weight Decay."

In [None]:
for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=500)[0].tolist()))

step 0: train loss 4.2221, val loss 4.2306
step 500: train loss 1.7600, val loss 1.9146
step 1000: train loss 1.3903, val loss 1.5987
step 1500: train loss 1.2644, val loss 1.5271
step 2000: train loss 1.1835, val loss 1.4978
step 2500: train loss 1.1233, val loss 1.4910
step 3000: train loss 1.0718, val loss 1.4804
step 3500: train loss 1.0179, val loss 1.5127
step 4000: train loss 0.9604, val loss 1.5102
step 4500: train loss 0.9125, val loss 1.5351
step 4999: train loss 0.8589, val loss 1.5565

But with prison, I will steal for the fimker.

KING HENRY VI:
To prevent it, as I love this country's cause.

HENRY BOLINGBROKE:
I thank bhop my follow. Walk ye were so?

NORTHUMBERLAND:
My lord, I hearison! Who may love me accurse
Some chold or flights then men shows to great the cur
Ye cause who fled the trick that did princely action?
Take my captiving sound, althoughts thy crown.

RICHMOND NE:
God neit will he not make it wise this!

DUKE VINCENTIO:
Worthy Prince forth from Lord Claudio!


This loop iterates max_iters times, where max_iters is the maximum number of training iterations specified in the hyperparameters.
Every eval_interval iterations or at the last iteration (max_iters - 1), the model's performance is evaluated on both the training and validation sets.
The estimate_loss function calculates the average loss over several evaluation iterations and returns a dictionary containing the training and validation losses.
The losses are then printed to monitor the training progress.
The model processes the input (xb) and target (yb) to compute the logits and the corresponding loss.
The optimizer's gradients are zeroed with optimizer.zero_grad(set_to_none=True) to avoid accumulating gradients from previous iterations.
Backpropagation is performed using loss.backward() to compute gradients.
The optimizer then takes a step in the parameter space to minimize the loss using optimizer.step()After training, the model is used for text generation.
A context tensor is initialized, and the generate method is called to generate text with a maximum of 500 new tokens.
The generated sequence is then printed.
This block essentially represents the core of the training process, including loss evaluation, parameter updates, and optional text generation. It's crucial for monitoring and improving the model during training.

How Does Dropout Work?

During training, at each update of the model's parameters, dropout randomly "drops out" (i.e., sets to zero) a subset of the neurons in the network.
This means that, for a short period, certain neurons do not contribute to the forward pass or backward pass of a specific training example.
The random dropout of neurons introduces a form of noise and prevents the network from becoming overly reliant on specific neurons or features. It encourages the network to learn more robust and generalizable representations.

2-
The FeedFoward class represents a feedforward neural network layer used in the Transformer model. Let's break down its structure and purpose:

1. Initialization:

The __init__ method is the constructor that defines the architecture of the feedforward layer.
It takes n_embd as a parameter, which represents the embedding dimension, a key parameter in the Transformer model.
2. Network Architecture:

The feedforward layer consists of a simple neural network, defined using nn.Sequential.
It starts with a linear (fully connected) layer that maps the input of dimension n_embd to an intermediate dimension of 4 * n_embd.
This is followed by a Rectified Linear Unit (ReLU) activation function, introducing non-linearity to the network.
The output of the ReLU activation is then passed through another linear layer that maps from the intermediate dimension back to the original embedding dimension, n_embd.
Finally, a dropout layer is applied. Dropout is a regularization technique that randomly sets a fraction of input units to zero during training to prevent overfitting.
3. Forward Pass:

The forward method defines how the input x is processed through the feedforward layer during the forward pass.
It simply passes the input x through the defined network (self.net), applying the linear transformations and non-linearities in sequence.
4. Purpose:

The purpose of this feedforward layer is to introduce non-linearity and learn complex patterns from the input embeddings.
The intermediate dimension of 4 * n_embd allows the network to capture more intricate relationships in the data.
Dropout is applied to regularize the network, preventing overfitting and improving generalization to unseen data.
5. Analogy: Math Problem Solving

Think of this like solving a math problem: you start with a simple equation, apply a non-linear transformation (like squaring), then simplify it further. Dropout is like occasionally skipping a step or introducing variability in your problem-solving approach to improve your understanding.
In the context of the Transformer model, this feedforward layer is a crucial component within each transformer block, contributing to the model's ability to capture and learn complex patterns in sequential data.

3-The Block class represents a single Transformer block, which is a fundamental building block of the Transformer model. Let's break down its structure and purpose:

1. Initialization:

The __init__ method is the constructor that defines the components of the Transformer block.
It takes n_embd as the embedding dimension and n_head as the number of attention heads.
head_size is calculated as n_embd // n_head, representing the size of each attention head.
2. Components of the Transformer Block:

Self-Attention (sa): This is an instance of the MultiHeadAttention class, which is responsible for capturing relationships between different positions of the input sequence.
Feedforward (ffwd): This is an instance of the FeedForward class, a simple neural network layer that introduces non-linearity and captures complex patterns in the data.
Layer Normalization (ln1, ln2): Two instances of layer normalization (nn.LayerNorm) are applied before and after the self-attention and feedforward components, respectively. Layer normalization helps stabilize the training process.
3. Forward Pass:

The forward method defines the forward pass of the Transformer block.
It first applies self-attention (self.sa) to the input x after passing it through layer normalization (self.ln1(x)).
The output is then added to the original input (x), creating a residual connection.
After another layer normalization (self.ln2(x)), the result is passed through the feedforward component (self.ffwd).
Again, the output is added to the previous result, creating another residual connection.
The final output of the block is returned.
4. Purpose:

The Transformer block enables the model to capture hierarchical and long-range dependencies in the input sequence.
Self-attention allows the model to focus on different parts of the sequence, and the feedforward component captures complex patterns.
Layer normalization and residual connections contribute to stable and efficient training.
5. Analogy: Team Collaboration

Think of this like a team collaboration: self-attention is like team members communicating and sharing information, feedforward is like individual team members working on specific tasks, and layer normalization and residuals are like team members maintaining a consistent and stable workflow.
In the context of the overall Transformer model, stacking multiple such blocks allows the model to learn and represent intricate patterns in sequential data, making it effective for various natural language processing tasks.

The GPTLanguageModel class defines the architecture of the GPT (Generative Pre-trained Transformer) language model. Let's go through its components and their functionalities:

1. Token and Position Embeddings:

self.token_embedding_table: Embedding layer for token embeddings. Each token in the vocabulary is represented as an embedding vector.
self.position_embedding_table: Embedding layer for position embeddings. It assigns each position in the input sequence a unique embedding vector. This helps the model understand the sequential order of tokens.
2. Transformer Blocks:

self.blocks: A stack of Transformer blocks. The number of blocks is determined by the n_layer hyperparameter. Each block consists of self-attention, feedforward networks, and layer normalization.
3. Layer Normalization and Final Layer:

self.ln_f: Layer normalization applied to the final output of the Transformer blocks. It helps stabilize and normalize the outputs.
self.lm_head: Linear layer that produces logits for the next token based on the final output of the model.
4. Initialization:

The model uses the nn.Module.apply method to initialize the weights of linear and embedding layers. The initialization is performed using a normal distribution with a mean of 0 and a standard deviation of 0.02. This is a common practice for better convergence during training.
5. Initialization Method (_init_weights):

The _init_weights method is used to initialize the weights of linear and embedding layers.
For linear layers (nn.Linear), weights are initialized from a normal distribution, and biases are set to zero.
For embedding layers (nn.Embedding), weights are also initialized from a normal distribution.
6. Purpose:

The GPT language model processes input sequences by embedding tokens and positions, passing them through multiple Transformer blocks, and generating logits for the next token.
The model is trained to minimize the cross-entropy loss between predicted logits and actual tokens in the training data.
7. Model Initialization:

An instance of this model is created using model = GPTLanguageModel() and moved to the specified device (CPU or GPU).
The total number of parameters in the model is printed to give an idea of the model size.
This architecture is fundamental to the success of GPT models, enabling them to capture complex patterns and dependencies in sequential data.The forward method of the GPTLanguageModel class defines how input sequences are processed during both training and inference. Additionally, the generate method is designed for generating new sequences. Let's break down these methods:

1. Forward Method (forward):

Input:
idx: Tensor of shape (B, T), representing a batch of input sequences where each element is an index corresponding to a token in the vocabulary.
targets: Optional tensor of shape (B, T), representing the target tokens for training.
Processing Steps:
Token and Position Embedding: Embeds tokens and adds positional embeddings.
Transformer Blocks: Passes the embedded sequences through the stack of Transformer blocks (self.blocks).
Layer Normalization: Applies layer normalization (self.ln_f) to the output of Transformer blocks.
Logits Generation: Produces logits for the next token using a linear layer (self.lm_head).
Loss Computation (if targets provided): Computes the cross-entropy loss between predicted logits and target tokens.
2. Generation Method (generate):

Input:
idx: Tensor of shape (B, T), representing a batch of input sequences for generating new tokens.
max_new_tokens: Maximum number of new tokens to generate.
Processing Steps:
Crop Sequence: Keeps only the last block_size tokens in the input sequence.
Get Predictions: Generates logits for the next token by calling the forward method.
Focus on Last Time Step: Extracts logits corresponding to the last time step.
Softmax: Applies softmax to obtain a probability distribution over the vocabulary.
Sampling: Uses multinomial sampling to select the next token based on the probability distribution.
Update Sequence: Appends the sampled token to the running sequence.
Repeat: Repeats the process for the specified number of max_new_tokens.
These methods demonstrate how the GPT language model can be used for both training (with targets provided) and generation (without targets). During generation, the model autoregressively predicts the next token and updates the input sequence for subsequent predictions. This mechanism allows the model to generate coherent sequences of text.