# Group 5 GPT Project IS 640

https://www.kaggle.com/datasets/suraj520/imdb-tv-series-data?select=musical_series.csv

## Milestone 1: Dataset Exploration and Preparation

### Import all the modules and packages

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import pandas as pd
import numpy as np
import zipfile
import os

### Description: 
This dataset contains information about TV series from IMDb, including details such as title, IMDb ID, release year, genre, cast, synopsis, rating, runtime, certificate, number of votes, and gross revenue. The data is scraped from the IMDb website using web scraping techniques and is organized into separate CSV files for each genre.

### Features:

- Title: The title of the TV series.
- IMDb ID: The unique identifier for the series on IMDb.
- Release Year: The year in which the series was released.
- Genre: The genre(s) of the series.
- Cast: The main cast members of the series.
- Synopsis: A brief summary or description of the series.
- Rating: The average rating of the series on IMDb (scaled from 1 to 10).
- Runtime: The duration of each episode or the total runtime of the series.
- Certificate: The content rating or certificate assigned to the series (e.g., PG-13, TV-MA).
- Number of Votes: The total number of votes or ratings received by the series.
- Gross Revenue: The total gross revenue generated by the series (if available).

### Objective:

We aim to generate text using the GPT transformer model, focusing exclusively on the 'Synopsis' column of the TV series dataset. Our goal is to clean and preprocess the 'Synopsis' data by converting all text to lowercase and replacing non-alphanumeric characters (except dots) with spaces, and then utilize the GPT transformer to generate coherent and relevant text based on the cleaned synopsis data.

### Basic ETL and Data Cleansing

### Extract the Zip folder

In [None]:
# Path to the local ZIP file
zip_file_path = 'tv_series_data.zip'

# Extract the ZIP file to a folder
extracted_folder = 'tv_series_data'
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extracted_folder)

### Combine all CSV files into one DataFrame

In [None]:

combined_data = pd.DataFrame()
for file in os.listdir(extracted_folder):
    if file.endswith('.csv'):
        file_path = os.path.join(extracted_folder, file)
        df = pd.read_csv(file_path)
        combined_data = pd.concat([combined_data, df], ignore_index=True)

  combined_data = pd.concat([combined_data, df], ignore_index=True)


### View the first 5 rows of the dataset

In [None]:
combined_data.head()

Unnamed: 0,Title,IMDb ID,Release Year,Genre,Cast,Synopsis,Rating,Runtime,Certificate,Number of Votes,Gross Revenue
0,Spider-Man: Across the Spider-Verse,tt9362722,2023,"Animation, Action, Adventure","Directors:, Joaquim Dos Santos, , Kemp Powers,...","Miles Morales catapults across the Multiverse,...",9.1,140 min,PG,71960,
1,FUBAR,tt13064902,2023–,"Action, Adventure, Thriller","Stars:, Arnold Schwarzenegger, , Monica Barbar...",A C.I.A. operative on the edge of retirement d...,6.5,,TV-MA,15422,
2,Barry,tt5348176,2018–2023,"Action, Comedy, Crime","Stars:, Bill Hader, , Stephen Root, , Sarah Go...",A hit man from the Midwest moves to Los Angele...,8.4,30 min,TV-MA,101883,
3,John Wick: Chapter 4,tt10366206,2023,"Action, Crime, Thriller","Director:, Chad Stahelski, | , Stars:, Kea...",John Wick uncovers a path to defeating The Hig...,8.0,169 min,R,195078,
4,Fast X,tt5433140,2023,"Action, Adventure, Crime","Director:, Louis Leterrier, | , Stars:, Vi...",Dom Toretto and his family are targeted by the...,6.3,141 min,PG-13,39326,


### Inspect the combined data and view the information of the dataset

In [None]:
print("Combined Data Shape:", combined_data.shape)
print("\nCombined Data Columns:", combined_data.columns)
print("\nCombined Data Info:")
print(combined_data.info())
print("\nCombined Data Description:")
print(combined_data.describe())
print("\nMissing Values:")
print(combined_data.isnull().sum())

Combined Data Shape: (236828, 11)

Combined Data Columns: Index(['Title', 'IMDb ID', 'Release Year', 'Genre', 'Cast', 'Synopsis',
       'Rating', 'Runtime', 'Certificate', 'Number of Votes', 'Gross Revenue'],
      dtype='object')

Combined Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236828 entries, 0 to 236827
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Title            236828 non-null  object 
 1   IMDb ID          236828 non-null  object 
 2   Release Year     236819 non-null  object 
 3   Genre            236828 non-null  object 
 4   Cast             235956 non-null  object 
 5   Synopsis         236828 non-null  object 
 6   Rating           236828 non-null  float64
 7   Runtime          216983 non-null  object 
 8   Certificate      169091 non-null  object 
 9   Number of Votes  236828 non-null  object 
 10  Gross Revenue    45611 non-null   object 
dtypes: float64(1), objec

### Clean the data: convert to lowercase and replace non-alphanumeric characters (except dots) with spaces

In [None]:
def clean_text(text):
    return ''.join(char.lower() if char.isalnum() or char == '.' else ' ' for char in text)

### Create a new df called cleaned_data which contains only the cleaned text

In [None]:
cleaned_text = combined_data['Synopsis'].apply(clean_text)

In [None]:
cleaned_data = pd.DataFrame(cleaned_text, columns=['Synopsis'])

In [None]:
# Display the first few rows of the cleaned data
print(cleaned_data.head())

                                            Synopsis
0  miles morales catapults across the multiverse ...
1  a c.i.a. operative on the edge of retirement d...
2  a hit man from the midwest moves to los angele...
3  john wick uncovers a path to defeating the hig...
4  dom toretto and his family are targeted by the...


### Rename column header from Synopsis to text

In [None]:
cleaned_data = cleaned_data.rename(columns={'Synopsis': 'text'})

### Save it into a csv file called `tv_series_synopsis_full.csv`

In [None]:
# Save the cleaned data to a CSV file
cleaned_data.to_csv('tv_series_synopsis_full.csv', index=False)

### Define the hyperparameters for fine tuning

In [2]:
batch_size = 32 # Number of sequences processed in parallel during training
block_size = 128 # Maximum context length for predictions (sequence length)
max_iters = 5000 # Total number of training iterations
eval_interval = 100 # How often to evaluate the model (every 100 iterations)
learning_rate = 1e-3  # Step size for gradient descent optimization
device = 'cuda' if torch.cuda.is_available() else 'cpu' # Use GPU if available, otherwise CPU
eval_iters = 200 # Number of iterations for loss estimation during evaluation
n_embd = 128 # Dimensionality of the token embeddings and model's hidden layers
n_head = 8  # Number of attention heads in each self-attention layer
n_layer = 8 # Number of transformer layers in the model
dropout = 0.1 # Probability of dropping out neurons during training (regularization)

torch.manual_seed(1337)  # Set random seed for reproducibility

<torch._C.Generator at 0x222ff8b3fb0>

### Choosing TV show data as the dataset

In [3]:
df = pd.read_csv('Dataset/tv_series_synopsis_full.csv', encoding='latin-1')
df['combined'] =  df['text'].astype(str)
text = " ".join(df['combined'].dropna().tolist())
text[:500]  # print the first 500 characters of the text

'miles morales catapults across the multiverse, where he encounters a team of spiderpeople charged with protecting its very existence. when the heroes clash on how to handle a new threat, miles must redefine what it means to be a hero. a c.i.a. operative on the edge of retirement discovers a family secret and is called back into the field for one last job. a hit man from the midwest moves to los angeles and gets caught up in the citys theatre arts scene. john wick uncovers a path to defeating the'

### Converting string to numerical format for training and testing.
1. Extract the unique characters and find the count of the vocabulary
2. Map the characters to integers and vice versa
3. Define the encode function which converts strings into numerical format
4. Define the decode function which converts numbers into strings

In [4]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

### Diving the data into training and validation sets
1. Encode the text into numbers so that it can be processed as a pytorch tensor
2. Define the split ratio
3. Make the training and validation sets

In [5]:
# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9*len(data)) # first 90% will be train, rest val
train_data = data[:n]
val_data = data[n:]

### Create functions for batch loading and loss estimation
`get_batch`:
Creates small, random batches of input-output pairs for training or validation.
Ensures the model learns from diverse examples within the dataset.

`estimate_loss`:
Provides a measure of the model's performance on both training and validation datasets.
Helps monitor overfitting (training loss much lower than validation loss) and guide hyperparameter tuning.

In [6]:
# data loading
def get_batch(split):
    """
    Generate a small batch of data of inputs x and targets y.

    Args:
        split: 'train' or 'val'. if 'train', we sample from train_data, otherwise val_data

    Returns:
        x: a tensor of shape (bs, block_size) representing the input sequence
        y: a tensor of shape (bs, block_size) representing the target sequence
    """
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)
    return x, y

@torch.no_grad()
def estimate_loss():
    """
    Estimates the average loss for the training and validation datasets 
    over a fixed number of evaluation iterations.

    Returns:
        Dict[str, float]: A dictionary containing the mean loss for both the 
        training and validation datasets. Keys are:
            - 'train': Mean loss for the training dataset.
            - 'val': Mean loss for the validation dataset.
    """
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

## Milestone 2: Basic Model Usage (Bigram Language Model)

Description: This milestone introduces a simple bigram language model. It predicts the next token based solely on the current token, without considering any broader context.

How it works: The model uses a simple lookup table to predict the next token based on the current one.

Code changes:
- Implementation of a basic nn.Embedding layer for token prediction
- Simple forward pass that uses only the current token to predict the next

Metrics: Basic tracking of training and validation loss.

In [None]:
class BigramLanguageModel(nn.Module):
    """
    A simple bigram-based language model that predicts the next token 
    based on the current token using an embedding layer. This model is 
    primarily used as a basic demonstration of language modeling concepts.

    Args:
        vocab_size (int): The size of the vocabulary, defining the number of unique tokens.

    Attributes:
        token_embedding_table (nn.Embedding): Embedding layer that maps tokens to logits 
            for all tokens in the vocabulary.

    Methods:
        forward(idx, targets=None):
            Performs the forward pass of the model, computing logits for the next token 
            and optionally calculating the cross-entropy loss.

            Args:
                idx (torch.Tensor): Tensor of shape (B, T) containing input token indices, 
                    where B is the batch size and T is the sequence length.
                targets (torch.Tensor, optional): Tensor of shape (B, T) containing target 
                    token indices for loss computation. Default is None.

            Returns:
                Tuple[torch.Tensor, torch.Tensor or None]:
                    - logits (torch.Tensor): Tensor of shape (B, T, vocab_size) containing 
                      predicted logits for the next token.
                    - loss (torch.Tensor or None): Scalar tensor representing the cross-entropy 
                      loss if `targets` is provided, otherwise None.

        generate(idx, max_new_tokens):
            Generates a sequence of tokens by sampling from the model's predictions.

            Args:
                idx (torch.Tensor): Tensor of shape (B, T) containing the initial context 
                    (sequence of token indices).
                max_new_tokens (int): Number of new tokens to generate.

            Returns:
                torch.Tensor: Tensor of shape (B, T + max_new_tokens) containing the initial 
                context concatenated with the generated tokens.

    Examples:
        >>> vocab_size = 100
        >>> model = BigramLanguageModel(vocab_size)
        >>> idx = torch.tensor([[1, 2, 3]])
        >>> logits, loss = model(idx, targets=torch.tensor([[2, 3, 4]]))
        >>> generated_sequence = model.generate(idx, max_new_tokens=2000)
    """
    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

        
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

In [8]:
model = BigramLanguageModel(vocab_size)
m = model.to(device)
# print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')

0.001681 M parameters


### Create a PyTorch optimizer for updating the model's parameter's during training
AdamW is a variant of the Adam optimizer that includes decoupled weight decay, making it better suited for modern deep learning models like transformers.
Key features:
Combines adaptive learning rates (like Adam) with the L2 regularization benefits of weight decay.
Helps prevent overfitting and stabilizes training by penalizing large weights.

In [9]:
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [10]:
# Initialize lists to store losses
train_losses = []
val_losses = []


for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()

        train_loss = losses['train']
        val_loss = losses['val']
        
        # Store losses
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train')

    # evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_text = decode(m.generate(context, max_new_tokens=2000)[0].tolist())
avg_train_loss = np.mean(train_losses)
avg_val_loss = np.mean(val_losses)
print(f"\nTraining completed.")
print(f"Average training loss: {avg_train_loss:.4f}")
print(f"Average validation loss: {avg_val_loss:.4f}")
print(generated_text)

step 0: train loss 3.7144, val loss 3.7144
step 100: train loss 3.5966, val loss 3.5968
step 200: train loss 3.4881, val loss 3.4882
step 300: train loss 3.3879, val loss 3.3888
step 400: train loss 3.2970, val loss 3.2976
step 500: train loss 3.2133, val loss 3.2137
step 600: train loss 3.1374, val loss 3.1377
step 700: train loss 3.0673, val loss 3.0676
step 800: train loss 3.0045, val loss 3.0047
step 900: train loss 2.9477, val loss 2.9475
step 1000: train loss 2.8961, val loss 2.8960
step 1100: train loss 2.8501, val loss 2.8489
step 1200: train loss 2.8076, val loss 2.8065
step 1300: train loss 2.7710, val loss 2.7692
step 1400: train loss 2.7357, val loss 2.7348
step 1500: train loss 2.7042, val loss 2.7033
step 1600: train loss 2.6782, val loss 2.6760
step 1700: train loss 2.6546, val loss 2.6520
step 1800: train loss 2.6320, val loss 2.6294
step 1900: train loss 2.6111, val loss 2.6089
step 2000: train loss 2.5963, val loss 2.5913
step 2100: train loss 2.5815, val loss 2.5766


In [11]:
# Save the text to a file
with open('milestone2.txt', 'w', encoding='utf-8') as f:
    f.write(generated_text)


## Milestone 3: Self-attention & Softmax Iteration

Description: This milestone introduces self-attention, allowing the model to consider the entire input sequence when making predictions.

How it works: The model computes attention scores between all pairs of tokens in the input sequence, allowing it to capture dependencies between distant tokens.

Code changes:
- Implementation of self-attention mechanism
- Addition of query, key, and value projections
- Softmax applied to attention scores

Metrics: Potentially lower loss due to the model's increased ability to capture context.

In [None]:
# Define the Self-Attention Head
class Head(nn.Module):
    """
    A self-attention head that computes attention weights and applies them to the input.

    This class is a component of a larger transformer model, responsible for computing self-attention over the input sequence.

    Attributes:
        key (nn.Linear): Linear layer to transform input into key vectors.
        query (nn.Linear): Linear layer to transform input into query vectors.
        value (nn.Linear): Linear layer to transform input into value vectors.
        tril (torch.Tensor): A triangular matrix used for masking to prevent future tokens from attending to past tokens.

    Methods:
        __init__(head_size): Initializes the self-attention head with the given head size.
        forward(x): Performs the forward pass through the self-attention mechanism.

    Examples:
        >>> head = Head(head_size=64)
        >>> x = torch.randn(1, 10, 128)  # Example input sequence
        >>> output = head(x)
    """
    def __init__(self, head_size):
        """
        Initializes the self-attention head.

        Args:
            head_size (int): The size of the key, query, and value vectors.
        """
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        """
        Performs the forward pass through the self-attention mechanism.

        Args:
            x (torch.Tensor): Input sequence. Shape: (B, T, C)

        Returns:
            out (torch.Tensor): The output of the self-attention mechanism. Shape: (B, T, C)
        """
        B, T, C = x.shape
        k = self.key(x)  # (B, T, C)
        q = self.query(x)  # ( B, T, C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2, -1) * k.shape[-1] ** -0.5
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # (B, T, T)
        wei = F.softmax(wei, dim=-1)  # (B, T, T)
        # perform the weighted aggregation of the values
        v = self.value(x)  # (B, T, C)
        out = wei @ v  # (B, T, C)
        return out

In [None]:
# Update the Bigram Language Model to include the Self-Attention Head

class BigramLanguageModelWithAttention(nn.Module):
    """
    A language model that extends the basic Bigram model by incorporating self-attention mechanisms.

    This model uses token embeddings, position embeddings, and a self-attention head to capture contextual relationships between tokens.
    It is designed to predict the next token in a sequence based on the context provided by the previous tokens.

    Attributes:
        vocab_size (int): The size of the vocabulary.
        n_embd (int): The dimensionality of the embeddings.
        block_size (int): The maximum sequence length that the model can handle.
        device (torch.device): The device on which the model is run.

    Methods:
        __init__(vocab_size): Initializes the model with the given vocabulary size.
        _init_weights(module): Initializes the weights of the model using a normal distribution.
        forward(idx, targets=None): Forward pass through the model to generate logits and optionally compute the loss.
        generate(idx, max_new_tokens): Generates new tokens based on the given context.

    Examples:
        >>> model = BigramLanguageModelWithAttention(vocab_size=10000)
        >>> idx = torch.randint(0, 10000, (1, 10))  # Example input sequence
        >>> logits, loss = model.forward(idx)
        >>> generated_sequence = model.generate(idx, max_new_tokens=20)
    """

    def __init__(self, vocab_size):
        """
        Initializes the BigramLanguageModelWithAttention.

        Args:
            vocab_size (int): The size of the vocabulary.
        """
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.sa_head = Head(n_embd)  # self-attention head
        self.lm_head = nn.Linear(n_embd, vocab_size)

        
        self.apply(self._init_weights)

    def _init_weights(self, module):
        """
        Initializes the weights of the model using a normal distribution.

        Args:
            module (nn.Module): The module to initialize.
        """
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        """
        Performs a forward pass through the model to generate logits and optionally compute the loss.

        Args:
            idx (torch.Tensor): Input sequence of token indices. Shape: (B, T)
            targets (torch.Tensor, optional): Target sequence for computing loss. Shape: (B, T). Defaults to None.

        Returns:
            logits (torch.Tensor): Logits for the next token in the sequence. Shape: (B, T, vocab_size)
            loss (torch.Tensor or None): Cross-entropy loss if targets are provided, otherwise None.
        """
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)  # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, C)
        x = tok_emb + pos_emb  # (B, T, C)
        x = self.sa_head(x)  # apply one head of self-attention
        logits = self.lm_head(x)  # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        """
        Generates new tokens based on the given context.

        Args:
            idx (torch.Tensor): Initial context sequence of token indices. Shape: (B, T)
            max_new_tokens (int): The maximum number of new tokens to generate.

        Returns:
            idx (torch.Tensor): The extended sequence with new tokens. Shape: (B, T + max_new_tokens)
        """
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # New Forward Pass
            logits, _ = self(idx_cond)
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

In [14]:
# Create the model and move it to the device
model = BigramLanguageModelWithAttention(vocab_size)
m = model.to(device)

#Create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [15]:
# Initialize lists to store losses
train_losses = []
val_losses = []


for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()

        train_loss = losses['train']
        val_loss = losses['val']
        
        # Store losses
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    #Sample a batch of data
    xb, yb = get_batch('train')

    #Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#Generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_text = decode(m.generate(context, max_new_tokens=2000)[0].tolist())
avg_train_loss = np.mean(train_losses)
avg_val_loss = np.mean(val_losses)
print(f"\nTraining completed.")
print(f"Average training loss: {avg_train_loss:.4f}")
print(f"Average validation loss: {avg_val_loss:.4f}")
print(generated_text)

#Save the generated text to a file
with open('milestone3.txt', 'w', encoding='utf-8') as f:
    f.write(generated_text)
		

step 0: train loss 3.7135, val loss 3.7135
step 100: train loss 2.9268, val loss 2.9213
step 200: train loss 2.8600, val loss 2.8560
step 300: train loss 2.7558, val loss 2.7559
step 400: train loss 2.7301, val loss 2.7247
step 500: train loss 2.6967, val loss 2.6891
step 600: train loss 2.6297, val loss 2.6304
step 700: train loss 2.5942, val loss 2.5912
step 800: train loss 2.5697, val loss 2.5658
step 900: train loss 2.5377, val loss 2.5330
step 1000: train loss 2.5142, val loss 2.5120
step 1100: train loss 2.4849, val loss 2.4833
step 1200: train loss 2.4725, val loss 2.4701
step 1300: train loss 2.4646, val loss 2.4621
step 1400: train loss 2.4553, val loss 2.4552
step 1500: train loss 2.4499, val loss 2.4453
step 1600: train loss 2.4421, val loss 2.4367
step 1700: train loss 2.4394, val loss 2.4353
step 1800: train loss 2.4337, val loss 2.4325
step 1900: train loss 2.4348, val loss 2.4317
step 2000: train loss 2.4331, val loss 2.4278
step 2100: train loss 2.4311, val loss 2.4270


## Milestone 4: Multi-head Attention

Description: This milestone extends self-attention to multi-head attention, allowing the model to capture different types of relationships between tokens.

How it works: The model computes multiple sets of attention (heads) in parallel, then combines their outputs.

Code changes:
- Implementation of multiple attention heads
- Concatenation and projection of multiple head outputs

Metrics: Possible further reduction in loss; may see improved performance on tasks requiring different types of attention.

In [None]:
# Define the Self-Attention Head
class Head(nn.Module):
    """
    A self-attention head that computes attention weights and applies them to the input.

    This class is a component of a larger transformer model, responsible for computing self-attention over the input sequence.

    Attributes:
        key (nn.Linear): Linear layer to transform input into key vectors.
        query (nn.Linear): Linear layer to transform input into query vectors.
        value (nn.Linear): Linear layer to transform input into value vectors.
        tril (torch.Tensor): A triangular matrix used for masking to prevent future tokens from attending to past tokens.

    Methods:
        __init__(head_size): Initializes the self-attention head with the given head size.
        forward(x): Performs the forward pass through the self-attention mechanism.

    Examples:
        >>> head = Head(head_size=64)
        >>> x = torch.randn(1, 10, 128)  # Example input sequence
        >>> output = head(x)
    """
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

    def forward(self, x):
        B, T, C = x.shape
        k = self.key(x)  # (B, T, C)
        q = self.query(x)  # (B, T, C)
        # Compute attention scores ("affinities")
        wei = q @ k.transpose(-2, -1) * C ** -0.5  # (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # (B, T, T)
        wei = F.softmax(wei, dim=-1)  # (B, T, T)
        # Perform the weighted aggregation of the values
        v = self.value(x)  # (B, T, C)
        out = wei @ v  # (B, T, C)
        return out

# Define the Multi-head Attention
class MultiHeadAttention(nn.Module):
    """
    A multi-head attention mechanism that combines the outputs of multiple self-attention heads.

    This class is used to enhance the model's ability to capture different aspects of the input data by using multiple attention heads.

    Attributes:
        heads (nn.ModuleList): A list of self-attention heads.
        proj (nn.Linear): A linear layer to project the concatenated outputs of the heads back to the original embedding size.
        # dropout (nn.Dropout): Optional dropout layer (currently commented out).

    Methods:
        __init__(num_heads, head_size): Initializes the multi-head attention mechanism with the given number of heads and head size.
        forward(x): Performs the forward pass through the multi-head attention mechanism.

    Examples:
        >>> mha = MultiHeadAttention(num_heads=8, head_size=64)
        >>> x = torch.randn(1, 10, 128)  # Example input sequence
        >>> output = mha(x)
    """
    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        # self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        # out = self.dropout(self.proj(out))
        return out

In [None]:

# Update the Bigram Language Model to include Multi-head attention
class BigramLanguageModelWithMultiHeadAttention(nn.Module):
    """
    A language model that extends the basic Bigram model by incorporating multi-head attention and multiple layers.

    This model uses token embeddings, position embeddings, and a stack of multi-head attention blocks to capture complex contextual relationships between tokens.
    It is designed to predict the next token in a sequence based on the context provided by the previous tokens.

    Attributes:
        token_embedding_table (nn.Embedding): Embedding layer to map tokens to embeddings.
        position_embedding_table (nn.Embedding): Embedding layer to add positional information.
        blocks (nn.Sequential): A sequence of multi-head attention blocks.
        ln_f (nn.LayerNorm): Final layer normalization layer.
        lm_head (nn.Linear): Linear layer to transform embeddings to logits.
        block_size (int): The maximum sequence length that the model can handle.

    Methods:
        __init__(vocab_size, n_embd, block_size, n_head, n_layer): Initializes the model with the given parameters.
        _init_weights(module): Initializes the weights of the model using a normal distribution.
        forward(idx, targets=None): Performs a forward pass through the model to generate logits and optionally compute the loss.
        generate(idx, max_new_tokens): Generates new tokens based on the given context.

    Examples:
        >>> model = BigramLanguageModelWithMultiHeadAttention(vocab_size=10000, n_embd=128, block_size=10, n_head=8, n_layer=2)
        >>> idx = torch.randint(0, 10000, (1, 10))  # Example input sequence
        >>> logits, loss = model.forward(idx)
        >>> generated_sequence = model.generate(idx, max_new_tokens=20)
    """
    def __init__(self, vocab_size, n_embd, block_size, n_head, n_layer):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[MultiHeadAttention(n_head, n_embd // n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)  # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)
        self.block_size = block_size

        
        self.apply(self._init_weights)


    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)  # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, C)
        x = tok_emb + pos_emb  # (B, T, C)
        x = self.blocks(x)  # apply one multi-head attention block
        x = self.ln_f(x)  # (B, T, C)
        logits = self.lm_head(x)  # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -self.block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx
    


In [18]:
# Create the model and move it to the device
# model = BigramLanguageModelWithMultiHeadAttention(vocab_size, n_embd, block_size, n_head, n_layer, dropout)
model = BigramLanguageModelWithMultiHeadAttention(vocab_size, n_embd, block_size, n_head, n_layer)
m = model.to(device)

# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [19]:
# Initialize lists to store losses
train_losses = []
val_losses = []


for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()

        train_loss = losses['train']
        val_loss = losses['val']
        
        # Store losses
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # Sample a batch of data
    xb, yb = get_batch('train')

    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#Generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_text = decode(m.generate(context, max_new_tokens=2000)[0].tolist())
avg_train_loss = np.mean(train_losses)
avg_val_loss = np.mean(val_losses)
print(f"\nTraining completed.")
print(f"Average training loss: {avg_train_loss:.4f}")
print(f"Average validation loss: {avg_val_loss:.4f}")
print(generated_text)

# Save the generated text to a file
with open('milestone4.txt', 'w', encoding='utf-8') as f:
    f.write(generated_text)

step 0: train loss 3.7136, val loss 3.7136
step 100: train loss 2.9258, val loss 2.9219
step 200: train loss 2.9264, val loss 2.9214
step 300: train loss 2.9270, val loss 2.9219
step 400: train loss 2.9281, val loss 2.9226
step 500: train loss 2.9266, val loss 2.9259
step 600: train loss 2.9256, val loss 2.9215
step 700: train loss 2.9255, val loss 2.9220
step 800: train loss 2.9259, val loss 2.9237
step 900: train loss 2.9300, val loss 2.9294
step 1000: train loss 2.9273, val loss 2.9234
step 1100: train loss 2.9321, val loss 2.9297
step 1200: train loss 2.9266, val loss 2.9323
step 1300: train loss 2.9296, val loss 2.9273
step 1400: train loss 2.9285, val loss 2.9260
step 1500: train loss 3.0335, val loss 3.0347
step 1600: train loss 2.9328, val loss 2.9279
step 1700: train loss 2.9248, val loss 2.9230
step 1800: train loss 2.9278, val loss 2.9264
step 1900: train loss 2.9281, val loss 2.9232
step 2000: train loss 2.9269, val loss 2.9221
step 2100: train loss 2.9256, val loss 2.9226


## Milestone 5: Feed Forward Layers

Description: This milestone adds feed-forward layers after the attention mechanism, increasing the model's capacity to process information.

How it works: After attention, the output passes through a feed-forward neural network, typically with one hidden layer and ReLU activation.

Code changes:
- Addition of feed-forward layers after attention
- Typically includes a dimensionality increase followed by a decrease

Metrics: Potential for lower loss and better generalization due to increased model capacity.

In [None]:
# Define the Feed Forward Layer
class FeedForward(nn.Module):
    """
    A feed-forward network (FFN) layer, often used in transformer models to transform the output of self-attention mechanisms.

    This layer consists of two linear layers with a ReLU activation function in between, which helps in transforming the input embeddings.

    Attributes:
        net (nn.Sequential): A sequential module containing the feed-forward network layers.

    Methods:
        __init__(n_embd): Initializes the feed-forward layer with the given embedding size.
        forward(x): Performs the forward pass through the feed-forward network.

    Examples:
        >>> ffn = FeedForward(n_embd=128)
        >>> x = torch.randn(1, 10, 128)  # Example input sequence
        >>> output = ffn(x)
    """
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
        )

    def forward(self, x):
        return self.net(x)

In [None]:
# Define the Transformer Block
class Block(nn.Module):
    """
    A transformer block that combines self-attention and feed-forward network (FFN) layers.

    This block is a fundamental component of transformer models, allowing the model to capture both contextual relationships and complex transformations of the input data.

    Attributes:
        sa (MultiHeadAttention): Multi-head attention layer.
        ffwd (FeedForward): Feed-forward network layer.

    Methods:
        __init__(n_embd, n_head): Initializes the transformer block with the given embedding size and number of heads.
        forward(x): Performs the forward pass through the transformer block.

    Examples:
        >>> block = Block(n_embd=128, n_head=8)
        >>> x = torch.randn(1, 10, 128)  # Example input sequence
        >>> output = block(x)
    """
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        
    def forward(self, x):
        x = self.sa(x)
        x = self.ffwd(x)
        return x

In [None]:
# Update the Bigram Language Model to include Feed Forward Layers
class BigramLanguageModelWithFeedForward(nn.Module):
    """
    A language model that extends the basic Bigram model by incorporating transformer blocks with self-attention and feed-forward layers.

    This model uses token embeddings, position embeddings, and a stack of transformer blocks to capture complex contextual relationships between tokens.
    It is designed to predict the next token in a sequence based on the context provided by the previous tokens.

    Attributes:
        token_embedding_table (nn.Embedding): Embedding layer to map tokens to embeddings.
        position_embedding_table (nn.Embedding): Embedding layer to add positional information.
        blocks (nn.Sequential): A sequence of transformer blocks.
        lm_head (nn.Linear): Linear layer to transform embeddings to logits.
        block_size (int): The maximum sequence length that the model can handle.

    Methods:
        __init__(vocab_size, n_embd, block_size, n_head, n_layer): Initializes the model with the given parameters.
        _init_weights(module): Initializes the weights of the model using a normal distribution.
        forward(idx, targets=None): Performs a forward pass through the model to generate logits and optionally compute the loss.
        generate(idx, max_new_tokens): Generates new tokens based on the given context.

    Examples:
        >>> model = BigramLanguageModelWithFeedForward(vocab_size=10000, n_embd=128, block_size=10, n_head=8, n_layer=2)
        >>> idx = torch.randint(0, 10000, (1, 10))  # Example input sequence
        >>> logits, loss = model.forward(idx)
        >>> generated_sequence = model.generate(idx, max_new_tokens=20)
    """
    def __init__(self, vocab_size, n_embd, block_size, n_head, n_layer):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        self.lm_head = nn.Linear(n_embd, vocab_size)
        self.block_size = block_size

        
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)  # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, C)
        x = tok_emb + pos_emb  # (B, T, C)
        x = self.blocks(x)  # apply transformer blocks
        # x = self.ln_f(x)  # (B, T, C)
        logits = self.lm_head(x)  # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -self.block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

In [23]:
# model = BigramLanguageModelWithFeedForward(vocab_size, n_embd, block_size, n_head, n_layer, dropout)
model = BigramLanguageModelWithFeedForward(vocab_size, n_embd, block_size, n_head, n_layer)
m = model.to(device)

# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [24]:
# Initialize lists to store losses
train_losses = []
val_losses = []


for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()

        train_loss = losses['train']
        val_loss = losses['val']
        
        # Store losses
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # Sample a batch of data
    xb, yb = get_batch('train')

    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#Generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_text = decode(m.generate(context, max_new_tokens=2000)[0].tolist())
avg_train_loss = np.mean(train_losses)
avg_val_loss = np.mean(val_losses)
print(f"\nTraining completed.")
print(f"Average training loss: {avg_train_loss:.4f}")
print(f"Average validation loss: {avg_val_loss:.4f}")
print(generated_text)


# Save the generated text to a file
with open('milestone5.txt', 'w', encoding='utf-8') as f:
    f.write(generated_text)

step 0: train loss 3.7136, val loss 3.7136
step 100: train loss 2.9260, val loss 2.9204
step 200: train loss 2.9260, val loss 2.9214
step 300: train loss 2.9277, val loss 2.9230
step 400: train loss 2.9251, val loss 2.9246
step 500: train loss 2.9256, val loss 2.9207
step 600: train loss 2.9257, val loss 2.9201
step 700: train loss 2.9248, val loss 2.9207
step 800: train loss 2.9266, val loss 2.9215
step 900: train loss 2.9252, val loss 2.9199
step 1000: train loss 2.9247, val loss 2.9201
step 1100: train loss 2.9247, val loss 2.9210
step 1200: train loss 2.9254, val loss 2.9237
step 1300: train loss 2.9257, val loss 2.9211
step 1400: train loss 2.9284, val loss 2.9191
step 1500: train loss 2.9285, val loss 2.9215
step 1600: train loss 2.9235, val loss 2.9222
step 1700: train loss 2.9259, val loss 2.9207
step 1800: train loss 2.9266, val loss 2.9207
step 1900: train loss 2.9246, val loss 2.9226
step 2000: train loss 2.9272, val loss 2.9234
step 2100: train loss 2.9261, val loss 2.9209


## Milestone 6: Residual Connections

Description: This milestone introduces residual connections, allowing the model to bypass the attention and feed-forward layers when necessary.

How it works: The input to each sub-layer is added to its output, creating a residual connection.

Code changes:
- Addition of residual connections around attention and feed-forward layers

Metrics: May see more stable training and potentially better performance, especially for deeper models.

In [None]:
# Define the Feed Forward Layer
class FeedForward(nn.Module):
    """
    A feed-forward network (FFN) layer, often used in transformer models to transform the output of self-attention mechanisms.

    This layer consists of two linear layers with a ReLU activation function in between, which helps in transforming the input embeddings.

    Attributes:
        net (nn.Sequential): A sequential module containing the feed-forward network layers.

    Methods:
        __init__(n_embd): Initializes the feed-forward layer with the given embedding size.
        forward(x): Performs the forward pass through the feed-forward network.

    Examples:
        >>> ffn = FeedForward(n_embd=128)
        >>> x = torch.randn(1, 10, 128)  # Example input sequence
        >>> output = ffn(x)
    """
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            # nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)

In [None]:
# Define the Transformer Block with Residual Connections
class Block(nn.Module):
    """
    A transformer block that combines self-attention and feed-forward network (FFN) layers with residual connections.

    This block is a fundamental component of transformer models, allowing the model to capture both contextual relationships and complex transformations of the input data while leveraging residual connections for better training stability.

    Attributes:
        sa (MultiHeadAttention): Multi-head attention layer.
        ffwd (FeedForward): Feed-forward network layer.

    Methods:
        __init__(n_embd, n_head): Initializes the transformer block with the given embedding size and number of heads.
        forward(x): Performs the forward pass through the transformer block.

    Examples:
        >>> block = Block(n_embd=128, n_head=8)
        >>> x = torch.randn(1, 10, 128)  # Example input sequence
        >>> output = block(x)
    """
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)

    def forward(self, x):
        x = x + self.sa(x)  # Residual connection
        x = x + self.ffwd(x)  # Residual connection
        return x

In [None]:
# Update the Bigram Language Model to include Residual Connections
class BigramLanguageModelWithResidualConnections(nn.Module):
    """
    A language model that extends the basic Bigram model by incorporating transformer blocks with residual connections.

    This model uses token embeddings, position embeddings, and a stack of transformer blocks to capture complex contextual relationships between tokens.
    It is designed to predict the next token in a sequence based on the context provided by the previous tokens, leveraging residual connections for better training stability.

    Attributes:
        token_embedding_table (nn.Embedding): Embedding layer to map tokens to embeddings.
        position_embedding_table (nn.Embedding): Embedding layer to add positional information.
        blocks (nn.Sequential): A sequence of transformer blocks with residual connections.
        lm_head (nn.Linear): Linear layer to transform embeddings to logits.
        block_size (int): The maximum sequence length that the model can handle.

    Methods:
        __init__(vocab_size, n_embd, block_size, n_head, n_layer): Initializes the model with the given parameters.
        _init_weights(module): Initializes the weights of the model using a normal distribution.
        forward(idx, targets=None): Performs a forward pass through the model to generate logits and optionally compute the loss.
        generate(idx, max_new_tokens): Generates new tokens based on the given context.

    Examples:
        >>> model = BigramLanguageModelWithResidualConnections(vocab_size=10000, n_embd=128, block_size=10, n_head=8, n_layer=2)
        >>> idx = torch.randint(0, 10000, (1, 10))  # Example input sequence
        >>> logits, loss = model.forward(idx)
        >>> generated_sequence = model.generate(idx, max_new_tokens=20)
    """
    def __init__(self, vocab_size, n_embd, block_size, n_head, n_layer):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)])
        # self.ln_f = nn.LayerNorm(n_embd)  # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)
        self.block_size = block_size

        
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)  # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, C)
        x = tok_emb + pos_emb  # (B, T, C)
        x = self.blocks(x)  # apply transformer blocks
        # x = self.ln_f(x)  # (B, T, C)
        logits = self.lm_head(x)  # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -self.block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

In [28]:
# model = BigramLanguageModelWithResidualConnections(vocab_size, n_embd, block_size, n_head, n_layer, dropout)
model = BigramLanguageModelWithResidualConnections(vocab_size, n_embd, block_size, n_head, n_layer)
m = model.to(device)

# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [29]:
# Initialize lists to store losses
train_losses = []
val_losses = []


for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()

        train_loss = losses['train']
        val_loss = losses['val']
        
        # Store losses
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # Sample a batch of data
    xb, yb = get_batch('train')

    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#Generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_text = decode(m.generate(context, max_new_tokens=2000)[0].tolist())
avg_train_loss = np.mean(train_losses)
avg_val_loss = np.mean(val_losses)
print(f"\nTraining completed.")
print(f"Average training loss: {avg_train_loss:.4f}")
print(f"Average validation loss: {avg_val_loss:.4f}")
print(generated_text)


# Save the generated text to a file
with open('milestone6.txt', 'w', encoding='utf-8') as f:
    f.write(generated_text)

step 0: train loss 3.7140, val loss 3.7141
step 100: train loss 2.5140, val loss 2.5104
step 200: train loss 2.4340, val loss 2.4137
step 300: train loss 2.3875, val loss 2.3631
step 400: train loss 2.2791, val loss 2.2545
step 500: train loss 2.1404, val loss 2.1249
step 600: train loss 2.0039, val loss 1.9973
step 700: train loss 1.9341, val loss 1.9275
step 800: train loss 1.8453, val loss 1.8490
step 900: train loss 1.8029, val loss 1.8076
step 1000: train loss 1.7380, val loss 1.7530
step 1100: train loss 1.7095, val loss 1.7229
step 1200: train loss 1.6724, val loss 1.6849
step 1300: train loss 1.6567, val loss 1.6737
step 1400: train loss 1.6268, val loss 1.6445
step 1500: train loss 1.6148, val loss 1.6338
step 1600: train loss 1.5809, val loss 1.6093
step 1700: train loss 1.5614, val loss 1.5859
step 1800: train loss 1.5536, val loss 1.5672
step 1900: train loss 1.5477, val loss 1.5682
step 2000: train loss 1.5348, val loss 1.5483
step 2100: train loss 1.5112, val loss 1.5375


## Milestone 7: Layer Normalization

Description: This milestone adds layer normalization, which helps stabilize the activations in each layer.

How it works: Layer normalization normalizes the inputs across the features, reducing internal covariate shift.

Code changes:
- Addition of layer normalization after attention and feed-forward layers

Metrics: Often results in more stable training, potentially faster convergence, and sometimes better final performance.

In [None]:
# Define the Feed Forward Layer
class FeedForward(nn.Module):
    """
    A feed-forward network (FFN) layer, often used in transformer models to transform the output of self-attention mechanisms.

    This layer consists of two linear layers with a ReLU activation function in between, which helps in transforming the input embeddings.

    Attributes:
        net (nn.Sequential): A sequential module containing the feed-forward network layers.

    Methods:
        __init__(n_embd): Initializes the feed-forward layer with the given embedding size.
        forward(x): Performs the forward pass through the feed-forward network.

    Examples:
        >>> ffn = FeedForward(n_embd=128)
        >>> x = torch.randn(1, 10, 128)  # Example input sequence
        >>> output = ffn(x)
    """
    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
        )

    def forward(self, x):
        return self.net(x)

In [None]:
class Block(nn.Module):
    """
    A transformer block that combines self-attention and feed-forward network (FFN) layers with residual connections and layer normalization.

    This block is a fundamental component of transformer models, allowing the model to capture both contextual relationships and complex transformations of the input data while leveraging residual connections and layer normalization for better training stability.

    Attributes:
        sa (MultiHeadAttention): Multi-head attention layer.
        ffwd (FeedForward): Feed-forward network layer.
        ln1 (nn.LayerNorm): Layer normalization layer for the output of self-attention.
        ln2 (nn.LayerNorm): Layer normalization layer for the input to the feed-forward network.

    Methods:
        __init__(n_embd, n_head): Initializes the transformer block with the given embedding size and number of heads.
        forward(x): Performs the forward pass through the transformer block.

    Examples:
        >>> block = Block(n_embd=128, n_head=8)
        >>> x = torch.randn(1, 10, 128)  # Example input sequence
        >>> output = block(x)
    """
    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention(n_head, head_size)
        self.ffwd = FeedForward(n_embd)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

In [None]:
# Update the Bigram Language Model to include Layer Normalization
class BigramLanguageModelWithLayerNorm(nn.Module):
    """
    A language model that extends the basic Bigram model by incorporating transformer blocks with self-attention, feed-forward networks, residual connections and layer normalization.

    This model uses token embeddings, position embeddings, and a stack of transformer blocks to capture complex contextual relationships between tokens.
    It is designed to predict the next token in a sequence based on the context provided by the previous tokens, leveraging layer normalization for better training stability.

    Attributes:
        token_embedding_table (nn.Embedding): Embedding layer to map tokens to embeddings.
        position_embedding_table (nn.Embedding): Embedding layer to add positional information.
        blocks (nn.Sequential): A sequence of transformer blocks.
        ln_f (nn.LayerNorm): Final layer normalization layer.
        lm_head (nn.Linear): Linear layer to transform embeddings to logits.
        block_size (int): The maximum sequence length that the model can handle.

    Methods:
        __init__(vocab_size, n_embd, block_size, n_head, n_layer): Initializes the model with the given parameters.
        _init_weights(module): Initializes the weights of the model using a normal distribution.
        forward(idx, targets=None): Performs a forward pass through the model to generate logits and optionally compute the loss.
        generate(idx, max_new_tokens): Generates new tokens based on the given context.

    Examples:
        >>> model = BigramLanguageModelWithLayerNorm(vocab_size=10000, n_embd=128, block_size=10, n_head=8, n_layer=2)
        >>> idx = torch.randint(0, 10000, (1, 10))  # Example input sequence
        >>> logits, loss = model.forward(idx)
        >>> generated_sequence = model.generate(idx, max_new_tokens=20)
    """

    def __init__(self, vocab_size, n_embd, block_size, n_head, n_layer):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)  # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)
        self.block_size = block_size

        
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)  # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, C)
        x = tok_emb + pos_emb  # (B, T, C)
        x = self.blocks(x)  # apply transformer blocks
        x = self.ln_f(x)  # (B, T, C)
        logits = self.lm_head(x)  # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -self.block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

In [33]:
# Create the model and move it to the device
model = BigramLanguageModelWithLayerNorm(vocab_size, n_embd, block_size, n_head, n_layer)
m = model.to(device)

# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [34]:
# Initialize lists to store losses
train_losses = []
val_losses = []


for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()

        train_loss = losses['train']
        val_loss = losses['val']
        
        # Store losses
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # Sample a batch of data
    xb, yb = get_batch('train')

    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#Generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_text = decode(m.generate(context, max_new_tokens=2000)[0].tolist())
avg_train_loss = np.mean(train_losses)
avg_val_loss = np.mean(val_losses)
print(f"\nTraining completed.")
print(f"Average training loss: {avg_train_loss:.4f}")
print(f"Average validation loss: {avg_val_loss:.4f}")
print(generated_text)


# Save the generated text to a file
with open('milestone7.txt', 'w', encoding='utf-8') as f:
    f.write(generated_text)

step 0: train loss 3.7858, val loss 3.7868
step 100: train loss 2.5384, val loss 2.5330
step 200: train loss 2.4261, val loss 2.4037
step 300: train loss 2.3466, val loss 2.3229
step 400: train loss 2.2468, val loss 2.2223
step 500: train loss 2.0959, val loss 2.0824
step 600: train loss 1.9576, val loss 1.9528
step 700: train loss 1.8419, val loss 1.8437
step 800: train loss 1.7604, val loss 1.7682
step 900: train loss 1.6934, val loss 1.7093
step 1000: train loss 1.6436, val loss 1.6661
step 1100: train loss 1.6022, val loss 1.6216
step 1200: train loss 1.5747, val loss 1.5972
step 1300: train loss 1.5462, val loss 1.5654
step 1400: train loss 1.5218, val loss 1.5446
step 1500: train loss 1.5007, val loss 1.5247
step 1600: train loss 1.4831, val loss 1.5001
step 1700: train loss 1.4705, val loss 1.4852
step 1800: train loss 1.4517, val loss 1.4756
step 1900: train loss 1.4425, val loss 1.4627
step 2000: train loss 1.4254, val loss 1.4508
step 2100: train loss 1.4179, val loss 1.4410


## Milestone 8: Dropout

Description: This milestone introduces dropout, a regularization technique to prevent overfitting.

How it works: During training, randomly "drops out" (sets to zero) a proportion of neurons, forcing the network to be more robust.

Code changes:
- Addition of dropout layers after attention and feed-forward layers
- Implementation of dropout in the forward pass

Metrics: May see higher training loss but improved validation loss, indicating better generalization. The gap between training and validation performance often narrows.

In [None]:
class Head1(nn.Module):
    """
    A single head of self-attention with dropout.

    This class is a component of a larger transformer model, responsible for computing self-attention over the input sequence.
    It includes a dropout layer to prevent overfitting.

    Attributes:
        key (nn.Linear): Linear layer to transform input into key vectors.
        query (nn.Linear): Linear layer to transform input into query vectors.
        value (nn.Linear): Linear layer to transform input into value vectors.
        tril (torch.Tensor): A triangular matrix used for masking to prevent future tokens from attending to past tokens.
        dropout (nn.Dropout): Dropout layer to prevent overfitting.

    Methods:
        __init__(head_size): Initializes the self-attention head with the given head size.
        forward(x): Performs the forward pass through the self-attention mechanism.

    Examples:
        >>> head = Head1(head_size=64, dropout=0.1)
        >>> x = torch.randn(1, 10, 128)  # Example input sequence
        >>> output = head(x)
    """
    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # input of size (batch, time-step, channels)
        # output of size (batch, time-step, head size)
        B,T,C = x.shape
        k = self.key(x)   # (B,T,hs)
        q = self.query(x) # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,hs)
        out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out

In [None]:
# Define the Multi-head Attention
class MultiHeadAttention1(nn.Module):
    """
    A multi-head attention mechanism with dropout.

    This class combines the outputs of multiple self-attention heads and includes a dropout layer to prevent overfitting.

    Attributes:
        heads (nn.ModuleList): A list of self-attention heads.
        proj (nn.Linear): Linear layer to project the concatenated outputs back to the original embedding size.
        dropout (nn.Dropout): Dropout layer to prevent overfitting.

    Methods:
        __init__(num_heads, head_size, dropout): Initializes the multi-head attention mechanism with the given number of heads and head size.
        forward(x): Performs the forward pass through the multi-head attention mechanism.

    Examples:
        >>> mha = MultiHeadAttention1(num_heads=8, head_size=64, dropout=0.1)
        >>> x = torch.randn(1, 10, 128)  # Example input sequence
        >>> output = mha(x)
    """
    def __init__(self, num_heads, head_size,dropout):
        super().__init__()
        self.heads = nn.ModuleList([Head1(head_size) for _ in range(num_heads)])
        self.proj = nn.Linear(head_size * num_heads, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        out = self.dropout(self.proj(out))
        return out

In [None]:
# Define the Feed Forward Layer
class FeedForward(nn.Module):
    """
    A feed-forward network (FFN) layer with dropout.

    This layer consists of two linear layers with a ReLU activation function in between and includes a dropout layer to prevent overfitting.

    Attributes:
        net (nn.Sequential): A sequential module containing the feed-forward network layers.

    Methods:
        __init__(n_embd, dropout): Initializes the feed-forward layer with the given embedding size and dropout rate.
        forward(x): Performs the forward pass through the feed-forward network.

    Examples:
        >>> ffn = FeedForward(n_embd=128, dropout=0.1)
        >>> x = torch.randn(1, 10, 128)  # Example input sequence
        >>> output = ffn(x)
    """

    def __init__(self, n_embd, dropout):
        """
        Initializes the feed-forward layer.

        Args:
            n_embd (int): The dimensionality of the input and output embeddings.
            dropout (float): The dropout rate.
        """
    def __init__(self, n_embd, dropout):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)

In [None]:
# Define the Transformer Block with Residual Connections, Layer Normalization, and Dropout
class Block(nn.Module):
    """
    A transformer block that combines self-attention and feed-forward network (FFN) layers with residual connections, layer normalization, and dropout.

    This block is a fundamental component of transformer models, allowing the model to capture both contextual relationships and complex transformations of the input data while leveraging residual connections, layer normalization, and dropout for better training stability.

    Attributes:
        sa (MultiHeadAttention1): Multi-head attention layer.
        ffwd (FeedForward): Feed-forward network layer.
        ln1 (nn.LayerNorm): Layer normalization layer for the output of self-attention.
        ln2 (nn.LayerNorm): Layer normalization layer for the input to the feed-forward network.
        dropout (nn.Dropout): Dropout layer to prevent overfitting.

    Methods:
        __init__(n_embd, n_head, dropout): Initializes the transformer block with the given embedding size, number of heads, and dropout rate.
        forward(x): Performs the forward pass through the transformer block.

    Examples:
        >>> block = Block(n_embd=128, n_head=8, dropout=0.1)
        >>> x = torch.randn(1, 10, 128)  # Example input sequence
        >>> output = block(x)
    """
    def __init__(self, n_embd, n_head, dropout):
        super().__init__()
        head_size = n_embd // n_head
        self.sa = MultiHeadAttention1(n_head, head_size, dropout)
        self.ffwd = FeedForward(n_embd, dropout)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = x + self.dropout(self.sa(self.ln1(x)))  # Residual connection after self-attention
        x = x + self.dropout(self.ffwd(self.ln2(x)))  # Residual connection after feed-forward
        return x

In [None]:
# Update the Bigram Language Model to include Dropout
class BigramLanguageModelWithDropout(nn.Module):
    """
    A language model that extends the basic Bigram model by incorporating transformer blocks with self-attention, feed-forward networks, residual connections, layer normalization, and dropout.

    This model uses token embeddings, position embeddings, and a stack of transformer blocks to capture complex contextual relationships between tokens.
    It is designed to predict the next token in a sequence based on the context provided by the previous tokens, leveraging dropout to prevent overfitting.

    Attributes:
        token_embedding_table (nn.Embedding): Embedding layer to map tokens to embeddings.
        position_embedding_table (nn.Embedding): Embedding layer to add positional information.
        blocks (nn.Sequential): A sequence of transformer blocks with dropout.
        ln_f (nn.LayerNorm): Final layer normalization layer.
        lm_head (nn.Linear): Linear layer to transform embeddings to logits.
        block_size (int): The maximum sequence length that the model can handle.
        dropout (nn.Dropout): Dropout layer to prevent overfitting.

    Methods:
        __init__(vocab_size, n_embd, block_size, n_head, n_layer, dropout): Initializes the model with the given parameters.
        _init_weights(module): Initializes the weights of the model using a normal distribution.
        forward(idx, targets=None): Performs a forward pass through the model to generate logits and optionally compute the loss.
        generate(idx, max_new_tokens): Generates new tokens based on the given context.

    Examples:
        >>> model = BigramLanguageModelWithDropout(vocab_size=10000, n_embd=128, block_size=10, n_head=8, n_layer=2, dropout=0.1)
        >>> idx = torch.randint(0, 10000, (1, 10))  # Example input sequence
        >>> logits, loss = model.forward(idx)
        >>> generated_sequence = model.generate(idx, max_new_tokens=20)
    """

    def __init__(self, vocab_size, n_embd, block_size, n_head, n_layer, dropout):
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        self.blocks = nn.Sequential(*[Block(n_embd, n_head, dropout) for _ in range(n_layer)])
        self.ln_f = nn.LayerNorm(n_embd)  # final layer norm
        self.lm_head = nn.Linear(n_embd, vocab_size)
        self.block_size = block_size
        self.dropout = nn.Dropout(dropout)

        
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        B, T = idx.shape
        tok_emb = self.token_embedding_table(idx)  # (B, T, C)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T, C)
        x = tok_emb + pos_emb  # (B, T, C)
        x = self.blocks(x)  # apply transformer blocks
        x = self.ln_f(x)  # (B, T, C)
        logits = self.lm_head(x)  # (B, T, vocab_size)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # crop idx to the last block_size tokens
            idx_cond = idx[:, -self.block_size:]
            # get the predictions
            logits, loss = self(idx_cond)
            # focus only on the last time step
            logits = logits[:, -1, :]  # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

In [40]:
# Create the model and move it to the device
model = BigramLanguageModelWithDropout(vocab_size, n_embd, block_size, n_head, n_layer, dropout)
m = model.to(device)

# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

In [41]:
# Initialize lists to store losses
train_losses = []
val_losses = []


for iter in range(max_iters):

    # every once in a while evaluate the loss on train and val sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        losses = estimate_loss()

        train_loss = losses['train']
        val_loss = losses['val']
        
        # Store losses
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # Sample a batch of data
    xb, yb = get_batch('train')

    # Evaluate the loss
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

#Generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
generated_text = decode(m.generate(context, max_new_tokens=2000)[0].tolist())
avg_train_loss = np.mean(train_losses)
avg_val_loss = np.mean(val_losses)
print(f"\nTraining completed.")
print(f"Average training loss: {avg_train_loss:.4f}")
print(f"Average validation loss: {avg_val_loss:.4f}")
print(generated_text)


# Save the generated text to a file
with open('milestone8.txt', 'w', encoding='utf-8') as f:
    f.write(generated_text)

step 0: train loss 3.7835, val loss 3.7824
step 100: train loss 2.4651, val loss 2.4493
step 200: train loss 2.3835, val loss 2.3565
step 300: train loss 2.2979, val loss 2.2785
step 400: train loss 2.1274, val loss 2.1079
step 500: train loss 1.9991, val loss 1.9943
step 600: train loss 1.9137, val loss 1.9118
step 700: train loss 1.8458, val loss 1.8490
step 800: train loss 1.7853, val loss 1.7883
step 900: train loss 1.7331, val loss 1.7451
step 1000: train loss 1.6900, val loss 1.6966
step 1100: train loss 1.6585, val loss 1.6763
step 1200: train loss 1.6261, val loss 1.6395
step 1300: train loss 1.6034, val loss 1.6136
step 1400: train loss 1.5661, val loss 1.5855
step 1500: train loss 1.5555, val loss 1.5777
step 1600: train loss 1.5334, val loss 1.5552
step 1700: train loss 1.5141, val loss 1.5337
step 1800: train loss 1.5083, val loss 1.5286
step 1900: train loss 1.4904, val loss 1.5077
step 2000: train loss 1.4776, val loss 1.4993
step 2100: train loss 1.4684, val loss 1.4877
