Name: Pallavi Dhakne
Roll No:BT20CSE201

# Task 1 : Building a basic GPT-2 Model

Start by implementing the `GPT2-small` model (with 125 million parameters) using Python and PyTorch. Make sure you touch upon the key aspects of the model like multi-head self-attention mechanism, feed-forward networks and positional encoding.

Key points:

- Follow the original GPT-2 design of using both token and positional embeddings.
- Implement the transformer layers with multi-head self-attention and point-wise feed-forward network.
- You're required to abstain from using pre-built transformer libraries.

Refer to the GPT-2 paper's architecture descriptions in Sections 1 and 2 for further help. ([GPT-2 paper](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)). Additionally, a great resource could be Andrej Karpathy’s [nanogpt](https://github.com/karpathy/nanoGPT) repository and the [makemore](https://youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&feature=shared) series.

To validate your implementation, load the original GPT-2 125M model checkpoints and run a sample prediction.

**Deliverable:** Complete Python code featuring the GPT-2 model along with demonstration of appropriate testing to verify its functioning.

Firsly we setup the import module
torch install using  
pip install torch


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

**Positional Encoding:**

Generating positional encodings to provide information about the position of tokens in the sequence. I am using sine and cosine functions to create these encodings.

In [None]:
class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super(PositionalEncoding, self).__init__()
        self.encoding = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-torch.log(torch.tensor(10000.0)) / d_model))
        self.encoding[:, 0::2] = torch.sin(position * div_term)
        self.encoding[:, 1::2] = torch.cos(position * div_term)
        self.encoding = self.encoding.unsqueeze(0)
        self.register_buffer('encoding', self.encoding)

    def forward(self, x):
        return self.encoding[:, :x.size(1)].detach() + x

**MultiHead Attention:**

Implementing the multi-head self-attention mechanism using linear layers to project the input embeddings into query, key, and value vectors. I am Applying attention mechanisms to these vectors and combine the outputs.

In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.depth = d_model // num_heads

        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)

        self.output = nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size):
        x = x.view(batch_size, -1, self.num_heads, self.depth)
        return x.permute(0, 2, 1, 3)

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        q = self.split_heads(self.query(q), batch_size)
        k = self.split_heads(self.key(k), batch_size)
        v = self.split_heads(self.value(v), batch_size)

        scores = torch.matmul(q, k.transpose(-2, -1)) / torch.sqrt(torch.tensor(self.depth, dtype=torch.float32))

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attention = F.softmax(scores, dim=-1)
        output = torch.matmul(attention, v)
        output = output.permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.d_model)
        return self.output(output)

**Feed Forward Networks:**

Constructing feed-forward neural networks consisting of multiple layers of linear transformations with non-linear activation functions.

In [None]:
class FeedForward(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForward, self).__init__()
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.linear_2 = nn.Linear(d_ff, d_model)

    def forward(self, x):
        return self.linear_2(F.relu(self.linear_1(x)))

**Transformer Layer:**

Combining the above components in a transformer layer where you apply multi-head self-attention followed by feed-forward networks and layer normalization.

In [None]:
class TransformerLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff):
        super(TransformerLayer, self).__init__()
        self.self_attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff)
        self.layer_norm_1 = nn.LayerNorm(d_model)
        self.layer_norm_2 = nn.LayerNorm(d_model)

    def forward(self, x, mask=None):
        attention_output = self.self_attention(x, x, x, mask)
        x = self.layer_norm_1(x + attention_output)
        ff_output = self.feed_forward(x)
        x = self.layer_norm_2(x + ff_output)
        return x

**Usages Example:**

In [None]:
# Usage example
seq_length = 50
batch_size = 16
vocab_size = 10000
embedding_size = 256
num_heads = 8
hidden_size = 512

# Create input tensor
input_data = torch.randint(0, vocab_size, (batch_size, seq_length))
embedding = nn.Embedding(vocab_size, embedding_size)(input_data)
pos_encoding = PositionalEncoding(embedding_size)(embedding)

transformer_layer = TransformerLayer(embedding_size, num_heads, hidden_size)
output = transformer_layer(pos_encoding)

print(output.shape)  # Check the output shape

#Task 2:Transformer Architectural Changes

In the second task, you are required to add alterations to your original GPT-2 model architecture to experiment and assess the potential of improvements. Here's what you need to do:

- **Rotary Positional Embedding:** Replace the original positional embeddings in the GPT-2 model with Rotary embeddings. You may refer to [Su et. al. RoFormer](https://arxiv.org/pdf/2104.09864.pdf).
- **Group Query Attention:** Equip your model with the Group Query Attention mechanism following the insights from the [Ainslie et. al. GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/pdf/2305.13245v2.pdf). Analyze how this mechanism can modify the model's operation compared to the standard attention mechanism.
- **Sliding Window Attention:** Imbibe the Sliding Window Attention mechanism in your model and observe its effects on model performance. Refer to the work by [Beltagy et. al. Longformer](https://arxiv.org/pdf/2004.05150v2.pdf) for better comprehension of its implementation and advantages.

**Deliverable:** Python code with any one, two or all three changes. Comment on the model size and capabilities, potential pitfalls and/or any improvement after each change. Points will be awarded for any combination of successful implementations.

1. **Rotatory Positional Embedding:**

Implementing Rotary Positional Embeddings involves modifying the way positional embeddings are generated in the model. This method follows the RoFormer paper by Su et al. Instead of using sine and cosine positional encodings, the rotational positional embeddings are calculated using sinusoidal functions in a different manner.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class RotaryPositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=512):
        super(RotaryPositionalEncoding, self).__init__()
        self.d_model = d_model
        self.max_len = max_len
        self.rotary_dims = 64  # Number of dimensions for rotary embedding

        # Create a matrix for rotational embeddings
        self.rotary_emb = nn.Parameter(torch.randn(self.max_len, self.rotary_dims // 2))

    def forward(self, x):
        seq_len = x.size(1)
        assert seq_len <= self.max_len, "Input sequence length exceeds maximum sequence length."

        # Extract embeddings and apply rotational embeddings
        positional_emb = torch.arange(seq_len).expand(x.size(0), seq_len).to(x.device)
        sinusoid_inp = positional_emb.unsqueeze(-1) / (10000 ** (torch.arange(0, self.d_model, 2) / self.d_model))

        sin_emb = torch.sin(sinusoid_inp[:, :, 0::2])
        cos_emb = torch.cos(sinusoid_inp[:, :, 1::2])

        # Apply rotation to positional embeddings
        # Assuming x is the original token embeddings
        rot_embeddings = torch.cat((sin_emb, cos_emb), dim=-1)
        rot_embeddings = torch.matmul(rot_embeddings, self.rotary_emb.to(rot_embeddings.device).transpose(0, 1))

        # Combine token embeddings with rotary positional embeddings
        output = x + rot_embeddings.unsqueeze(0)
        return output

# Example usage
seq_length = 50
batch_size = 16
vocab_size = 10000
embedding_size = 256

# Create input tensor
input_data = torch.randint(0, vocab_size, (batch_size, seq_length))
embedding = nn.Embedding(vocab_size, embedding_size)(input_data)

# Apply Rotary Positional Encoding
rotary_positional_encoding = RotaryPositionalEncoding(embedding_size)
output = rotary_positional_encoding(embedding)

print(output.shape)  # Check the output shape


2. **Group Query Attention:**


Implementing Group Query Attention involves modifying the standard attention mechanism in the transformer architecture to accommodate group queries for enhanced information retrieval.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class GroupQueryAttention(nn.Module):
    def __init__(self, d_model, num_heads, group_size):
        super(GroupQueryAttention, self).__init__()
        self.num_heads = num_heads
        self.group_size = group_size

        # Initialize linear layers for queries, keys, and values
        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)

        # Additional linear layer for grouping queries
        self.group_query = nn.Linear(d_model, num_heads * group_size)

    def split_heads(self, x, batch_size):
        x = x.view(batch_size, -1, self.num_heads, x.size(-1) // self.num_heads)
        return x.permute(0, 2, 1, 3)

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)

        # Generate grouped queries
        group_q = self.group_query(q).view(batch_size, self.num_heads, self.group_size, -1)

        # Apply linear transformations for queries, keys, and values
        q = self.split_heads(self.query(q), batch_size)
        k = self.split_heads(self.key(k), batch_size)
        v = self.split_heads(self.value(v), batch_size)

        # Compute attention scores
        scores = torch.matmul(group_q.unsqueeze(-2), k.transpose(-2, -1)) / torch.sqrt(torch.tensor(k.size(-1), dtype=torch.float32))
        scores = scores.squeeze(-2)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attention = F.softmax(scores, dim=-1)
        output = torch.matmul(attention.unsqueeze(-2), v).view(batch_size, self.num_heads, self.group_size, -1)
        output = output.permute(0, 2, 1, 3).contiguous().view(batch_size, -1, k.size(-1))

        return output

# Example usage
seq_length = 50
batch_size = 16
embedding_size = 256
num_heads = 8
group_size = 4

# Create input tensors
q = torch.randn(batch_size, seq_length, embedding_size)
k = torch.randn(batch_size, seq_length, embedding_size)
v = torch.randn(batch_size, seq_length, embedding_size)

# Apply Group Query Attention
group_query_attention = GroupQueryAttention(embedding_size, num_heads, group_size)
output = group_query_attention(q, k, v)

print(output.shape)  # Check the output shape

3. **Sliding Window Attention:**


Implementing Sliding Window Attention involves modifying the standard attention mechanism in the transformer architecture to incorporate a sliding window approach for more efficient attention computation over long sequences.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SlidingWindowAttention(nn.Module):
    def __init__(self, d_model, num_heads, window_size):
        super(SlidingWindowAttention, self).__init__()
        self.num_heads = num_heads
        self.window_size = window_size

        # Initialize linear layers for queries, keys, and values
        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size):
        x = x.view(batch_size, -1, self.num_heads, x.size(-1) // self.num_heads)
        return x.permute(0, 2, 1, 3)

    def forward(self, q, k, v, mask=None):
        batch_size = q.size(0)
        seq_length = q.size(1)

        # Calculate sliding windows
        windows = [(i, min(i + self.window_size, seq_length)) for i in range(0, seq_length, self.window_size)]

        # Initialize outputs
        outputs = []

        for start, end in windows:
            # Slice the input sequences within the window
            q_window = q[:, start:end, :]
            k_window = k[:, start:end, :]
            v_window = v[:, start:end, :]

            # Apply linear transformations for queries, keys, and values
            q_window = self.split_heads(self.query(q_window), batch_size)
            k_window = self.split_heads(self.key(k_window), batch_size)
            v_window = self.split_heads(self.value(v_window), batch_size)

            # Compute attention scores for the window
            scores = torch.matmul(q_window, k_window.transpose(-2, -1)) / torch.sqrt(torch.tensor(k_window.size(-1), dtype=torch.float32))

            if mask is not None:
                scores = scores.masked_fill(mask[:, start:end, start:end] == 0, float('-inf'))

            attention = F.softmax(scores, dim=-1)
            output = torch.matmul(attention, v_window)
            outputs.append(output)

        # Concatenate outputs from different windows
        output = torch.cat(outputs, dim=1)
        return output

# Example usage
seq_length = 1000  # Example sequence length
batch_size = 16
embedding_size = 256
num_heads = 8
window_size = 128  # Define window size

# Create input tensors
q = torch.randn(batch_size, seq_length, embedding_size)
k = torch.randn(batch_size, seq_length, embedding_size)
v = torch.randn(batch_size, seq_length, embedding_size)

# Apply Sliding Window Attention
sliding_window_attention = SlidingWindowAttention(embedding_size, num_heads, window_size)
output = sliding_window_attention(q, k, v)

print(output.shape)  # Check the output shape


# Task 3: Training Loop Implementation

Finally, create a training loop considering these following requirements:

1. **Single GPU Training Loop:** Your base implementation should be equipped to train your model on a single GPU setup.
2. **Distributed Data Parallel (DDP):** Extend your single GPU training loop to support training across multiple GPUs using DDP. Revisit the [PyTorch's DDP tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) for guidance.
3. **Fully Sharded Data Parallel (FSDP):** Implement FSDP as a part of your training loop to shard the model parameters, gradients, and optimizer state. You can follow [Gupta et al., 2020, Training GPT-3 Like Models on a Single Machine](https://arxiv.org/pdf/2101.06840.pdf) for a comprehensive understanding of it.

**Deliverable:** A Python script containing a functional training loop that is compatible with single GPU, DDP, and FSDP options along with a documentation illustrating how the code adapts to each setting.

**Evaluation Scheme:** Each feature implementation will account for:

- Single GPU: 10 points
- DDP: 10 points
- FSDP: 20 points

**Note:** Document your code, approaches, difficulties encountered, and your solutions
thoroughly. Include any reference materials you used in your report. Focus on clear communication of your methodologies and results.

**Submission:**

For each subtask, submit your source code and a brief description of your implementations. If relevant, please support your findings with visualizations of the alterations and their impacts.

Please remember, partial points will be awarded for each part, so it's better to submit an incomplete assignment than no assignment at all.

1. **Single GPU Training Loop:**

Training a deep learning model on a single GPU involves setting up the training loop, handling data loading, optimizer updates, and logging. Here's an explanation and a simplified code snippet illustrating a training loop for a model using a single GPU in PyTorch:

* **Device Assignment:** Set the device to the GPU for computation.
* **Data Loading:** Prepare your dataset and data loaders.
* **Model Initialization:** Instantiate your model and move it to the GPU.
* **Loss Function and Optimizer:** Define the loss function and optimizer.
* **Training Loop:** Iterate through batches, perform forward pass, calculate loss, backward pass (gradient computation), and optimizer step.
* **Logging and Monitoring:** Optionally, log and monitor training metrics.

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# Define your model architecture (replace this with your model)
class Model(nn.Module):
    def __init__(self):
        super(YourModel, self).__init__()
        # Define your model layers here

    def forward(self, x):
        # Implement forward pass
        return x

# Create a dummy dataset and DataLoader (replace this with your dataset)
input_data = torch.randn(1000, 10)  # Example input data
target_data = torch.randint(0, 2, (1000,))  # Example target data

dataset = TensorDataset(input_data, target_data)
data_loader = DataLoader(dataset, batch_size=32, shuffle=True)

# Define your model, loss function, and optimizer
model = Model().cuda()  # Move model to GPU
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 10

for epoch in range(num_epochs):
    model.train()  # Set the model in training mode
    epoch_loss = 0.0

    for inputs, targets in data_loader:
        inputs, targets = inputs.cuda(), targets.cuda()  # Move inputs and targets to GPU
        optimizer.zero_grad()

        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item() * inputs.size(0)

    epoch_loss /= len(dataset)
    print(f"Epoch [{epoch + 1}/{num_epochs}] Loss: {epoch_loss:.4f}")

# Optionally, save the trained model
torch.save(model.state_dict(), 'my_model.pth')


2. **Distributed Data Parallel(DDP):**

Distributed Data Parallel (DDP) in PyTorch allows training models across multiple GPUs within a single machine or across multiple machines. It splits the mini-batches and scatters them across GPUs, computes gradients independently, and then synchronizes and updates the model parameters. Here's an explanation and an example code snippet illustrating how to use DDP for training across multiple GPUs:
* **Initialize DDP Process Group:** Initialize the process group for DDP.
* **Set Device and Model:** Assign the device to GPUs and move the model to each GPU.
* **Wrap Model with DDP:** Wrap the model with torch.nn.parallel.DistributedDataParallel.
* **Data Loading and Batch Splitting:** Modify the data loading to split the batches across GPUs using torch.utils.data.distributed.DistributedSampler.
* **Training Loop:** Similar to the single GPU training loop, iterate through batches, perform forward pass, calculate loss, backward pass, and optimizer step.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn.parallel
import torch.utils.data.distributed
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, TensorDataset

# Define your model architecture (replace this with your model)
class YourModel(nn.Module):
    def __init__(self):
        super(YourModel, self).__init__()
        # Define your model layers here

    def forward(self, x):
        # Implement forward pass
        return x

def train(rank, world_size):
    torch.manual_seed(42)
    # Initialize the process group for DDP
    dist.init_process_group(backend='nccl', init_method='tcp://127.0.0.1:23456', world_size=world_size, rank=rank)

    # Set the device to GPU and assign model to device
    device = torch.device(f'cuda:{rank}')
    torch.cuda.set_device(device)

    model = YourModel().to(device)

    # Wrap model with DDP
    model = DDP(model, device_ids=[rank])

    # Create a dummy dataset and DataLoader (replace this with your dataset)
    input_data = torch.randn(1000, 10)  # Example input data
    target_data = torch.randint(0, 2, (1000,))  # Example target data

    dataset = TensorDataset(input_data, target_data)
    sampler = torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    data_loader = DataLoader(dataset, batch_size=32, sampler=sampler)

    # Define your loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    # Training loop
    num_epochs = 10

    for epoch in range(num_epochs):
        model.train()  # Set the model in training mode
        epoch_loss = 0.0

        for inputs, targets in data_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            optimizer.zero_grad()

            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item() * inputs.size(0)

        epoch_loss /= len(dataset)
        print(f"Rank {rank} - Epoch [{epoch + 1}/{num_epochs}] Loss: {epoch_loss:.4f}")

    dist.destroy_process_group()

if __name__ == '__main__':
    world_size = torch.cuda.device_count()
    mp.spawn(train, args=(world_size,), nprocs=world_size)


3. **Fully Sharded Data Parallel (FSDP):**

Fully Sharded Data Parallel (FSDP) is a technique designed to shard model parameters, optimizer state, and gradients across multiple devices or GPUs. It allows training very large models that may not fit on a single GPU or device by distributing the model's components across multiple devices.
* **harding Model Parameters:** Splitting model parameters into shards and distributing them across devices.
* **Sharding Gradients and Optimizer State:** Similar to model parameters, gradients and optimizer states are sharded across devices.
* **Local SGD and Gradient Accumulation:** FSDP often employs Local SGD (Stochastic Gradient Descent) for independent updates on each shard, and gradient accumulation to synchronize gradients across shards.
* **Optimizer Modifications:** Adjustments in the optimizer for FSDP, such as applying gradient scaling.

In [None]:
import torch
from torch.utils.data import DataLoader, TensorDataset
from torch.nn.parallel import DistributedDataParallel as DDP
import torch_optimizer as optim  # For use with FSDP

# Define your model architecture (replace this with your model)
class YourModel(torch.nn.Module):
    def __init__(self):
        super(YourModel, self).__init__()
        # Define your model layers here

    def forward(self, x):
        # Implement forward pass
        return x

def train(rank, world_size):
    # Set the device to GPU and assign model to device
    device = torch.device(f'cuda:{rank}')
    torch.cuda.set_device(device)

    model = YourModel().to(device)

    # Wrap model with FSDP
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[rank])

    # Create a dummy dataset and DataLoader (replace this with your dataset)
    input_data = torch.randn(1000, 10)  # Example input data
    target_data = torch.randint(0, 2, (1000,))  # Example target data

    dataset = TensorDataset(input_data, target_data)
    sampler = torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=world_size, rank=rank)
    data_loader = DataLoader(dataset, batch_size=32, sampler=sampler)

    # Define your loss function and optimizer
    criterion = torch.nn.CrossEntropyLoss().to(device)
    optimizer = optim.DiffGrad(model.parameters(), lr=0.001)  # Use FSDP-compatible optimizer

    # Training loop
    num_epochs = 10

    for epoch in range(num_epochs):
        model.train()  # Set the model in training mode
        epoch_loss = 0.0

        for inputs, targets in data_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            optimizer.zero_grad()

            outputs = model(inputs)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item() * inputs.size(0)

        epoch_loss /= len(dataset)
        print(f"Rank {rank} - Epoch [{epoch + 1}/{num_epochs}] Loss: {epoch_loss:.4f}")

if __name__ == '__main__':
    world_size = torch.cuda.device_count()
    mp.spawn(train, args=(world_size,), nprocs=world_size)
