# Assignment 2

NLP and ML have gone through several phases of how models are trained in recent years. With the advent of diversified pretrained models, fine-tuning these models for downstream tasks has become the standard practice. In this assignment, we are about to embark on an exciting journey into the instruction tuning method—a technique designed to make LLMs more useful. The techniques we'll delve into are not only robust but also resource-efficient, allowing us to perform the task of fine-tuning even on the free T4 GPUs available in Kaggle notebooks.

Among the techniques we will explore is **Low-Rank Adaptation (LoRA)**, a method that has proven to be efficient and effective in adapting large pre-trained language models to specific tasks. LoRA is grounded in the hypothesis that updates to the weights during adaptation have a low "intrinsic rank", allowing us to constrain weight updates and reduce computational complexity, while preserving model performance.

Complementing LoRA, we will also engage with **mixed-precision training**. This technique combines different numerical precisions to perform computations, aiming to maximize the computational power of modern GPUs. Mixed-precision training can accelerate model training, reduce memory requirements, and thus enable us to train larger, more powerful models.

Finally, we will delve into **distributed training**, a must-know technique for handling very large models or datasets. With distributed training, we can leverage multiple GPUs or even multiple machines to collectively train a single model, effectively overcoming the limitations posed by the memory capacity of individual GPUs.

By the end of this assignment, you should be well-acquainted with these cutting-edge techniques and be capable of integrating them into your own machine learning projects. Let's embark on this exciting journey into the vanguard of machine learning fine-tuning methodologies!

### Dataset

The Stanford Alpaca dataset is a synthetic dataset developed by Stanford researchers and is part of a project that aims to build and share an instruction-following model called Alpaca. The dataset contains 52K examples used for fine-tuning the Alpaca model, with each example consisting of a unique instruction that the model should follow, an optional context or input for the task, and the corresponding output generated by the OpenAI's text-davinci-003 model. In thiss assigment, we will use an updated version - the Alpaca-GPT4 dataset - which also contains 52K instruction-following data but generated by GPT-4 with prompts in Alpaca. More information is available at the [Data release](https://github.com/tatsu-lab/stanford_alpaca/blob/main/README.md#data-release), the [Alpaca-GPT4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM#how-good-is-the-data) and [Alpaca project page](https://crfm.stanford.edu/2023/03/13/alpaca.html).

### Model

The [Qwen2.5 models](https://qwenlm.github.io/blog/qwen2.5/) were launched by Alibaba Cloud and pretrained on the latest dataset, consisting of up to 18 trillion tokens. Qwen2.5 models are designed to be more resilient to diverse system prompts, enhancing role-play implementation, condition-setting for chatbots, and supporting over 29 languages, including Vietnamese. In terms of benchmarking, Qwen2.5-72B stands out as a top-tier performer among open-source models. For example, Qwen2.5-72B achieves an MMLU score of 86.1, surpassing even larger models like LLaMA-3-405B.

As the Qwen2.5 series offers models ranging from 0.5 billion to 72 billion parameters, for this assignment, we will select Qwen2.5 1.5B, which aligns with our resource constraints.

For our purposes, we will fine-tune the Qwen2.5 1.5B model using the Alpaca-GPT4 dataset. This assignment will allow us to simulate the supervised fine-tuning phase, similar to the process employed in the development of models like ChatGPT. By leveraging the Alpaca-GPT4 dataset, we aim to enhance the Qwen2.5 1.5B's ability to follow instructions and perform tasks as specified by user




### Initial setup

To prepare the environment for our project, please execute the commands below:

#### Clone and Install Libraries (Skip this cell if you have already cloned the repository)

In [2]:
!git clone https://github.com/vietai-courses/Advanced-NLP05.git
%cd Advanced-NLP05/assignment_02
!pip install -r requirements.txt

Cloning into 'Advanced-NLP05'...
remote: Enumerating objects: 26, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (23/23), done.[K
remote: Total 26 (delta 2), reused 26 (delta 2), pack-reused 0 (from 0)[K
Receiving objects: 100% (26/26), 14.18 MiB | 22.10 MiB/s, done.
Resolving deltas: 100% (2/2), done.
/kaggle/working/Advanced-NLP05/assignment_02
Collecting git+https://github.com/huggingface/accelerate.git (from -r requirements.txt (line 7))
  Cloning https://github.com/huggingface/accelerate.git to /tmp/pip-req-build-ydg04oin
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate.git /tmp/pip-req-build-ydg04oin
  Resolved https://github.com/huggingface/accelerate.git to commit 200c9eb7833cfa505907f6f224ebf5a275aa6d92
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting g

#### RUN ONLY AS NEEDED (Execute this code if you have recently restarted the kernel).

The following instructions are intended for situations where you encounter an Out of Memory error and need to restart the kernel. After restarting the kernel, execute the code in this cell to resume operations smoothly. Therefore, this should only be run if you've recently restarted the kernel.

In [3]:
%cd Advanced-NLP05/assignment_02

[Errno 2] No such file or directory: 'Advanced-NLP05/assignment_02'
/kaggle/working/Advanced-NLP05/assignment_02


## Part 1: Low-Rank Adaptation of Large Language Models for Efficient Fine-tuning

![](https://miro.medium.com/v2/resize:fit:730/1*D_i25E9dTd_5HMa45zITSg.png)

Figure 1: LoRA method. We only train A and B.


### 1. Introduction

**LoRA, or Low-Rank Adaptation**, is a technique for adapting large language models to specific tasks or domains more efficiently. It's based on the observation that as models get larger, the conventional approach of full fine-tuning becomes less feasible due to the large number of parameters involved.

This process involves injecting the matrices into the dense layer's update, optimizing them for the specific adaptation task while the original pretrained model weights remain unchanged.

Here are some of the key points of the LoRA technique:

- **Freezing Pretrained Weights**: Instead of modifying all the parameters of a pretrained model during fine-tuning, LoRA freezes the pretrained weights. This means that the original model weights remain unchanged during the adaptation process.

- **Rank Decomposition Matrices**: LoRA freezes the pretrained model weights and *injects trainable rank decomposition matrices* into each layer of the Transformer architecture. These matrices are used to adjust the output of each layer in a way that's specific to the adaptation task.

- **Indirect Training of Dense Layers**: The rank decomposition matrices allow for the indirect training of each dense layer in the neural network. They are injected into the layer's update during the adaptation process and optimized to enhance the layer's performance on the specific task or domain.

- **Significant Reduction in Trainable Parameters**: By focusing on these rank decomposition matrices instead of the entire set of model weights, LoRA greatly reduces the number of trainable parameters for downstream tasks. For instance, in the case of GPT-3, LoRA can reduce the number of trainable parameters by a factor of 10,000.

- **Maintaining Model Performance**: Despite the significant reduction in the number of trainable parameters, the LoRA technique is designed to maintain or even improve the performance of the large language model on the specific task or domain.

In summary, LoRA is a method that tackles the challenge of adapting large language models to specific tasks or domains in a more efficient and feasible way, making the fine-tuning process more manageable and less resource-intensive.

### 2. Details

The LoRA technique introduces a mathematical concept known as low-rank approximation into the fine-tuning process of large language models. Here's a mathematical description of the process:

LoRA involves modifying the pre-trained weight matrix $\mathbf{W}_0 \in \mathbb{R}^{d \times k}$ of a neural network layer by introducing a low-rank parametrized update matrix $\Delta \mathbf{W} = \mathbf{B}\mathbf{A}$, where $\mathbf{B} \in \mathbb{R}^{d \times r}$, $\mathbf{A} \in \mathbb{R}^{r \times k}$, and $r \ll \min(d, k)$.

During the adaptation process, $\mathbf{W}_0$ is kept frozen, which means it does not receive any gradient updates. The trainable parameters are contained within $\mathbf{A}$ and $\mathbf{B}$, which form the low-rank update matrix $\Delta \mathbf{W}$.

It's important to note that both $\mathbf{W}_0$ and $\Delta \mathbf{W} = \mathbf{B}\mathbf{A}$ are multiplied with the same input, and their respective output vectors are summed. If $\mathbf{x}$ is the input and $\mathbf{h} = \mathbf{W}_0\mathbf{x}$ is the output of the original weight matrix, the modified output is:

$\mathbf{h} = \mathbf{W}_0\mathbf{x} + \Delta \mathbf{W} \mathbf{x} = \mathbf{W}_0\mathbf{x} + \mathbf{B}\mathbf{A}\mathbf{x} = (\mathbf{W}_0 + \mathbf{B}\mathbf{A})\mathbf{x}$ 

in which $\mathbf{W}_0 + \mathbf{B}\mathbf{A}$ is called **merge** operation. We will implement it in this assignment. 

At the beginning of training, we initialize $\mathbf{A}$ with a random Gaussian distribution and $\mathbf{B}$ with zero, such that $\Delta \mathbf{W} = \mathbf{B}\mathbf{A}$ is zero, as shown in **Figure 1**. This ensures that the initial output of the model remains the same as in the pre-training phase, and the adaptation starts from the original model state. 

The low-rank update $\Delta \mathbf{W} = \mathbf{B}\mathbf{A}$ then evolves during training, helping to specialize the model for a specific task while keeping the number of trainable parameters manageable. Additionally, $\Delta \mathbf{W}$ is scaled by $\frac{\alpha}{r}$ where $\alpha$ is a constant hyper-parameter. **(*)**

This process is applied for each Linear layer of self-attention layer in the Qwen2.5 language model, leading to an adapted model that's specialized for a specific task or domain, with significantly fewer trainable parameters than the original model. For example, with GPT-3 175B, VRAM consumption during training is reduced from 1.2TB to 350GB. If $r = 4$ and only the query and value projection matrices are adapted, the checkpoint size is reduced by approximately 10,000 times (from 350GB to 35MB). This allows training with significantly fewer GPUs and helps to avoid communication overhead.

Another benefit is the ability to switch between tasks at a lower cost by only swapping the LoRA weights, as opposed to all parameters. This enables the creation of many customized models that can be swapped in and out on the fly on machines that store the pre-trained weights in VRAM.

**(*)** The reason for scaling the update $\Delta W x$ by $\frac{\alpha}{r}$ is primarily for easier optimization.

Consider the scenario where the rank $r$ changes during training. If you were to increase or decrease $r$, without this scaling factor, it would significantly affect the magnitude of the weight updates and thereby the learning dynamics of the model. In other words, changing $r$ would mean that you need to retune the learning rate or other hyperparameters, which is a laborious and time-consuming task.

By scaling the updates by $\frac{\alpha}{r}$, the authors make the learning process more robust to changes in $r$. $\alpha$ is a constant, so this scaling factor effectively normalizes the magnitude of the updates relative to the rank of the low-rank approximation.

This way, even when $r$ changes, the overall scale of the updates remains approximately constant, meaning you can use the same learning rate and other hyperparameters. This is advantageous because it makes the training process more efficient and less sensitive to the choice of $r$.

Keep in mind that this is a heuristic and it may not always provide the optimal solution for every problem or dataset, but it is a practical choice that often works well in practice.

### 3. Implementation

Let's break down the code, please take a look at the `lora_layer.py` file. The main components are:

- **LoraLayer** class: This is a base class that provides common functionality for both linear and embedding layers using the LoRA technique. It keeps track of LoRA parameters including the rank **r** and two sets of weights **lora_A** and **lora_B** (or **lora_embedding_A** and **lora_embedding_B** for the embedding layer). Two methods, **update_layer** and **update_layer_embedding**, are defined to update these parameters for linear and embedding layers, respectively.

- **Linear** and **Embedding** classes: These classes extend their corresponding PyTorch classes (**nn.Linear** and **nn.Embedding**) and the **LoraLayer** class. They initialize their superclasses as well as the LoRA parameters, and overwrite the **merge**, **unmerge**, and **forward** methods to implement the LoRA technique. The **merge** method combines the original weights of the layer with the LoRA weights, and **unmerge** undoes this operation. The **forward** method applies the layer operation either with or without the LoRA technique, depending on whether LoRA is enabled.

This assignment is heavily based on the internal codebase of 🤗 PEFT library. 🤗 PEFT, or **Parameter-Efficient Fine-Tuning (PEFT)**, is a library for efficiently adapting pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model’s parameters. Recent state-of-the-art PEFT techniques achieve performance comparable to that of full fine-tuning.

If you are new to PEFT, get started by reading the [Quicktour](https://huggingface.co/docs/peft/quicktour) guide and conceptual guides for [LoRA](https://huggingface.co/docs/peft/main/en/conceptual_guides/lora) methods.


**!!!** If you want to modify the .py files in Kaggle, you may need to use [Magic Commands](https://www.kaggle.com/code/matinmahmoudi/complete-guide-to-magic-commands-a-to-z) such as **%load**, **%%writefile**.

#### Q1: Implement the Merge operation in LoRA (15 points)
In the provided `lora_layer.py` file, your task is to complete the `merge` method within the `Linear` class. As a useful reference, consider the already implemented `merge` method in the `Embedding` class. This should provide a clear guide on how to approach this task.

#### Q2: Implement the Forward Pass in LoRA (15 points)
In the provided `lora_layer.py` file, your task is to complete the `forward` method within the `Linear` class. As a useful reference, consider the already implemented `forward` method in the `Embedding` class. This should provide a clear guide on how to approach this task.

#### Q3: Construct the LoRA Model and Dataloaders for Training (20 points)
In the provided `train.py` file, your task is to complete the `load_pretrained_model` function and `prepare_dataloader`. The aforementioned PEFT's Quicktour guide and LoRA's conceptual guide can be useful references. Note, the specific details related to distributed training can be overlooked at this stage.

Once you've finished the implementation, it's time to train your LoRA model. Congratulations!

In [4]:
# %load lora_layer.py
import math
import warnings

import torch
import torch.nn as nn
import torch.nn.functional as F

# The base class for LoRA layers
class LoraLayer:
    def __init__(
        self,
        in_features: int,  # The number of input features
        out_features: int,  # The number of output features
    ):
        # Initializes dictionaries to store various parameters for each adapter in the layer
        self.r = {}  # The rank of the low-rank matrix
        self.lora_alpha = {}  # The scaling factor
        self.scaling = {}  # The calculated scaling factor (lora_alpha / r)

        # Dropout layers for each adapter
        self.lora_dropout = nn.ModuleDict({})

        # Weight matrices for the linear layers
        self.lora_A = nn.ModuleDict({})
        self.lora_B = nn.ModuleDict({})

        # Weight matrices for the embedding layers
        self.lora_embedding_A = nn.ParameterDict({})
        self.lora_embedding_B = nn.ParameterDict({})

        # Boolean flag indicating whether the weights have been merged
        self.merged = False

        # Boolean flag indicating whether the adapters are disabled
        self.disable_adapters = False

        # Stores the number of input and output features
        self.in_features = in_features
        self.out_features = out_features
    
    # Method to update the parameters of the layer with a new adapter
    def update_layer(self, adapter_name, r, lora_alpha, lora_dropout, init_lora_weights):
        # Updates the rank and scaling factor for the adapter
        self.r[adapter_name] = r
        self.lora_alpha[adapter_name] = lora_alpha

        # If dropout rate is greater than 0, creates a dropout layer, otherwise creates an identity layer
        if lora_dropout > 0.0:
            lora_dropout_layer = nn.Dropout(p=lora_dropout)
        else:
            lora_dropout_layer = nn.Identity()

        # Updates the dropout layer for the adapter
        self.lora_dropout.update(nn.ModuleDict({adapter_name: lora_dropout_layer}))

        # If rank is greater than 0, creates trainable parameters for the adapter
        if r > 0:
            self.lora_A.update(nn.ModuleDict({adapter_name: nn.Linear(self.in_features, r, bias=False)}))
            self.lora_B.update(nn.ModuleDict({adapter_name: nn.Linear(r, self.out_features, bias=False)}))
            self.scaling[adapter_name] = lora_alpha / r

        # If init_lora_weights is True, resets the parameters of the adapter
        if init_lora_weights:
            self.reset_lora_parameters(adapter_name)

        # Moves the layer to the same device as the weight tensor
        self.to(self.weight.device)

     # Method to update the parameters of the embedding layer with a new adapter
    def update_layer_embedding(self, adapter_name, r, lora_alpha, lora_dropout, init_lora_weights):
        # Updates the rank and scaling factor for the adapter
        self.r[adapter_name] = r
        self.lora_alpha[adapter_name] = lora_alpha

        # If dropout rate is greater than 0, creates a dropout layer, otherwise creates an identity layer
        if lora_dropout > 0.0:
            lora_dropout_layer = nn.Dropout(p=lora_dropout)
        else:
            lora_dropout_layer = nn.Identity()

        # Updates the dropout layer for the adapter
        self.lora_dropout.update(nn.ModuleDict({adapter_name: lora_dropout_layer}))

        # If rank is greater than 0, creates trainable parameters for the adapter
        if r > 0:
            self.lora_embedding_A.update(
                nn.ParameterDict({adapter_name: nn.Parameter(self.weight.new_zeros((r, self.in_features)))})
            )
            self.lora_embedding_B.update(
                nn.ParameterDict({adapter_name: nn.Parameter(self.weight.new_zeros((self.out_features, r)))})
            )
            self.scaling[adapter_name] = lora_alpha / r

        # If init_lora_weights is True, resets the parameters of the adapter
        if init_lora_weights:
            self.reset_lora_parameters(adapter_name)

        # Moves the layer to the same device as the weight tensor
        self.to(self.weight.device)

    # Method to reset the parameters of an adapter
    def reset_lora_parameters(self, adapter_name):
        if adapter_name in self.lora_A.keys():
            # initialize A the same way as the default for nn.Linear and B to zero
            nn.init.kaiming_uniform_(self.lora_A[adapter_name].weight, a=math.sqrt(5))
            nn.init.zeros_(self.lora_B[adapter_name].weight)
        if adapter_name in self.lora_embedding_A.keys():
            # initialize a the same way as the default for nn.linear and b to zero
            nn.init.zeros_(self.lora_embedding_A[adapter_name])
            nn.init.normal_(self.lora_embedding_B[adapter_name])

# LoRA implemented in an Embedding layer
class Embedding(nn.Embedding, LoraLayer):
    """
    The Embedding class is an extension of the PyTorch nn.Embedding class 
    and LoraLayer class to incorporate the LoRA method.
    """
    def __init__(
        self,
        adapter_name: str,
        num_embeddings: int,
        embedding_dim: int,
        r: int = 0,
        lora_alpha: int = 1,
        lora_dropout: float = 0.0,
        **kwargs,
    ):
        # Pop the init_lora_weights flag from kwargs
        init_lora_weights = kwargs.pop("init_lora_weights", True)

        # Call the constructors of the parent classes
        nn.Embedding.__init__(self, num_embeddings, embedding_dim, **kwargs)
        LoraLayer.__init__(self, in_features=num_embeddings, out_features=embedding_dim)

        # Freezing the pre-trained weight matrix
        self.weight.requires_grad = False

        # Reset the parameters of the Embedding layer and update it with the adapter
        nn.Embedding.reset_parameters(self)
        self.update_layer_embedding(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights)

        # Set the active adapter
        self.active_adapter = adapter_name

    # Separate low-rank approximation from original weight
    def unmerge(self, mode: bool = True):
        # If the weights are already unmerged, raise a warning
        if not self.merged:
            warnings.warn("Already unmerged. Nothing to do.")
            return
        # If the rank of the active adapter is greater than 0, subtract the product of the LoRA weights
        # from the weights of the embedding
        if self.r[self.active_adapter] > 0:
            self.weight.data -= (
                transpose(
                    self.lora_embedding_B[self.active_adapter] @ self.lora_embedding_A[self.active_adapter], True
                )
                * self.scaling[self.active_adapter]
            )
            self.merged = False

    # Merge low-rank approximation with original weights
    def merge(self):
        # If the weights are already merged, raise a warning
        if self.merged:
            warnings.warn("Already merged. Nothing to do.")
            return
        # If the rank of the active adapter is greater than 0, add the product of the LoRA weights
        # to the weights of the embedding
        if self.r[self.active_adapter] > 0:
            self.weight.data += (
                transpose(
                    self.lora_embedding_B[self.active_adapter] @ self.lora_embedding_A[self.active_adapter], True
                )
                * self.scaling[self.active_adapter]
            )
            self.merged = True

    # Defines the computation performed at every call.
    def forward(self, x: torch.Tensor):
        # If adapters are disabled and there is an active adapter with rank > 0 and it is merged
        # Subtract the LoRA weights from the original weights and set merged to False
        if self.disable_adapters:
            if self.r[self.active.adapter] > 0 and self.merged:
                self.weight.data -= (
                    transpose(
                        self.lora_embedding_B[self.active_adapter].weight
                        @ self.lora_embedding_A[self.active_adapter].weight,
                        True,
                    )
                    * self.scaling[self.active_adapter]
                )
                self.merged = False
            # Forward pass with the original weights
            return nn.Embedding.forward(self, x)

        # If there is an active adapter with rank > 0 and it is not merged
        elif self.r[self.active_adapter] > 0 and not self.merged:
            result = nn.Embedding.forward(self, x)
            # Compute the forward pass with the LoRA weights and add it to the result
            if self.r[self.active_adapter] > 0:
                after_A = F.embedding(
                    x,
                    self.lora_embedding_A[self.active_adapter].T,
                    self.padding_idx,
                    self.max_norm,
                    self.norm_type,
                    self.scale_grad_by_freq,
                    self.sparse,
                )
                result += (after_A @ self.lora_embedding_B[self.active_adapter].T) * self.scaling[self.active_adapter]
            return result
        else:
            return nn.Embedding.forward(self, x)


# Lora is implemented in a dense (Linear) layer
class Linear(nn.Linear, LoraLayer):
    
    def __init__(
        self,
        adapter_name: str,
        in_features: int,
        out_features: int,
        r: int = 0,
        lora_alpha: int = 1,
        lora_dropout: float = 0.0,
        fan_in_fan_out: bool = False,  # Set this to True if the layer to replace stores weight like (fan_in, fan_out)
        **kwargs,
    ):
        # Initialize weights for LoRA layer
        init_lora_weights = kwargs.pop("init_lora_weights", True)

        # Initialize linear and LoRA layers
        nn.Linear.__init__(self, in_features, out_features, **kwargs)
        LoraLayer.__init__(self, in_features=in_features, out_features=out_features)

        # Freezing the pre-trained weight matrix
        self.weight.requires_grad = False

        # Transpose the weight if the layer to replace stores weight like (fan_in, fan_out)
        self.fan_in_fan_out = fan_in_fan_out
        if fan_in_fan_out:
            self.weight.data = self.weight.data.T

        # Reset linear layer parameters and update LoRA layer
        nn.Linear.reset_parameters(self)
        self.update_layer(adapter_name, r, lora_alpha, lora_dropout, init_lora_weights)
        self.active_adapter = adapter_name

    def merge(self):
        # Merge low-rank approximation with original weights
        if self.active_adapter not in self.lora_A.keys():
            return
        if self.merged:
            warnings.warn("Already merged. Nothing to do.")
            return
        if self.r[self.active_adapter] > 0:
            # TODO: Merge the LoRA parameters by adding the product of lora_B weights and lora_A weights (after transposing 
            # if necessary) to the original weights, scaled by the LoRA scaling factor. After this operation, set the merged
            # flag to True.
    
    # the following is my answer to Question 1:
            
            #BEGIN CODE:
            # Add the scaled product of LoRA weights to the original weight
            self.weight.data += (
            transpose(self.lora_B[self.active_adapter].weight @ self.lora_A[self.active_adapter].weight, 
                      self.fan_in_fan_out) * self.scaling[self.active_adapter]
        )
            self.merged = True 
            #END CODE    

    def unmerge(self):
        # Separate low-rank approximation from original weights
        if self.active_adapter not in self.lora_A.keys():
            return
        if not self.merged:
            warnings.warn("Already unmerged. Nothing to do.")
            return
        if self.r[self.active_adapter] > 0:
            self.weight.data -= (
                transpose(
                    self.lora_B[self.active_adapter].weight @ self.lora_A[self.active_adapter].weight,
                    self.fan_in_fan_out,
                )
                * self.scaling[self.active_adapter]
            )
            self.merged = False

    def forward(self, x: torch.Tensor):
        previous_dtype = x.dtype
        if self.active_adapter not in self.lora_A.keys():
            return F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
        if self.disable_adapters:
            if self.r[self.active_adapter] > 0 and self.merged:
                self.unmerge()
            result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
        elif self.r[self.active_adapter] > 0 and not self.merged:
            result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
            # Changing data type for ensuring consistency
            x = x.to(self.lora_A[self.active_adapter].weight.dtype)
            
            # TODO: If the LoRA adapter is active and not merged, add the output of the LoRA layers to the result. This involves
            # passing the input through lora_A, applying dropout, then passing it through lora_B. The output is scaled by the
            # LoRA scaling factor and added to the result.

    # the following is my answer to Question 2:
            
            #BEGIN CODE:
            lora_output = self.lora_B[self.active_adapter](
            self.lora_dropout[self.active_adapter](self.lora_A[self.active_adapter](x))
        )
            result += lora_output * self.scaling[self.active_adapter]   
            #END CODE  
        else:
            result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
        
        # Reverting to the previous data type
        result = result.to(previous_dtype)
        return result
    
def transpose(weight, fan_in_fan_out):
    # Helper function to transpose weights if required
    return weight.T if fan_in_fan_out else weight


In [None]:
# %load train.py
import os
import torch
from tqdm import tqdm

from peft import get_peft_model, LoraConfig
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, DataCollatorForSeq2Seq

from contextlib import nullcontext

from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
import torch.distributed as dist
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data import DataLoader, SequentialSampler


from lora_model import LoraModelForCasualLM
from utils.common import download_from_driver
from prepare_data import create_datasets

import warnings
warnings.filterwarnings('ignore')
torch.manual_seed(42)
torch.backends.cudnn.deterministic = True


class Trainer:
    def __init__(self,
                 model,
                 tokenizer,
                 gpu_id: int,
                 is_ddp_training: bool = True,
                 output_dir: str = 'checkpoints/',
                 num_epochs: int = 10,
                 max_length: int = 128,
                 batch_size: int = 8,
                 mixed_precision_dtype=None,
                 gradient_accumulation_steps: int = 16):
        """
        Initialize the Trainer class.

        Args:
            model: Pretrained model object.
            tokenizer: Tokenizer object for text processing.
            num_epochs: Number of training epochs.
            max_length: Maximum sequence length.
            batch_size: Training batch size.
            gpu_id: GPU ID for training.
        """

        self.num_epochs = num_epochs
        self.max_length = max_length
        self.batch_size = batch_size
        self.output_dir = output_dir
        self.tokenizer = tokenizer
        self.is_ddp_training = is_ddp_training

        self.gpu_id = gpu_id
        self.model = model.to(f"cuda:{self.gpu_id}")
        self.gradient_accumulation_steps = gradient_accumulation_steps

        self.mixed_precision_dtype = mixed_precision_dtype
        self.ctx = None
        self.gradscaler = None

        # set mixed precision context
        self.set_mixed_precision_context(mixed_precision_dtype)

    def set_mixed_precision_context(self, mixed_precision_dtype):
        
        # TODO: Setup mixed precision training context
        if mixed_precision_dtype is None:
            # If 'mixed_precision_dtype' is None, use 'nullcontext',
            self.ctx = nullcontext()
        else:
            # TODO Otherwise, use 'torch.amp.autocast' context with the specified dtype, and initialize GradScaler if mixed_precision_dtype is float16.
            # Submission Code
            self.ctx = torch.cuda.amp.autocast(dtype=mixed_precision_dtype)
            if mixed_precision_dtype == torch.float16:
                self.gradscaler = torch.cuda.amp.GradScaler()
        


    def _set_ddp_training(self):
        # TODO: Initialize the DistributedDataParallel wrapper for the model.
        # You would need to pass the model and specify the device IDs
        # and output device for the data parallelism.
        self.model = DDP(self.model, device_ids=[self.gpu_id], output_device=self.gpu_id) #Submission Code


    def _run_batch(self, batch):
        """
        Run a single training batch.

        Args:
            batch: Batch data.

        Returns:
            Loss value for the batch.
        """

        with self.ctx:
            outputs = self.model(**batch)
            loss = outputs.loss / self.gradient_accumulation_steps  # Normalize loss
        loss_val = loss.item()

        # TODO: If 'mixed_precision_dtype' is torch.float16, you have to modify the backward using the gradscaler.
        if self.mixed_precision_dtype == torch.float16:
            self.gradscaler.scale(loss).backward() #Submission code

        else:
            loss.backward()

        return loss_val

    def _run_epoch(self, train_dataloader, epoch):
        """
        Run a single training epoch.

        Args:
            train_loader: Training data loader.
            epoch: Current epoch number.

        Returns:
            Total loss value for the epoch.
        """

        epoch_loss = 0
        self.model.train()

        if _is_master_process():
            train_progress_bar = tqdm(
                train_dataloader, desc=f"Epoch {epoch + 1} [Training]", position=0, leave=False)
        else:
            train_progress_bar = train_dataloader

        # Add counter for gradient accumulation
        steps = 0
        self.optimizer.zero_grad()  # Reset gradients at the beginning of each epoch
        for step, batch in enumerate(train_progress_bar):
            steps += 1
            batch = {key: value.to(self.gpu_id)
                     for key, value in batch.items()}
            loss = self._run_batch(batch)
            epoch_loss += loss

            # Perform optimizer step and reset gradients after accumulating enough gradients
            if steps % self.gradient_accumulation_steps == 0:

                # If 'mixed_precision_dtype' is torch.float16, you have to modify the gradient update step using the gradscaler.
                if self.mixed_precision_dtype==torch.float16:
                    self.gradscaler.step(self.optimizer) #submission code for optimizer step
                    self.gradscaler.update () #submission code for updating scaler factor

                else:
                    self.optimizer.step()
                self.optimizer.zero_grad()

                torch.cuda.empty_cache()
        epoch_loss /= (len(train_dataloader) /
                       self.gradient_accumulation_steps)
        return epoch_loss

    def _save_checkpoint(self, epoch):
        path_dir = f"{self.output_dir}/epoch_{epoch}"

        # check path_dir exited
        if not os.path.exists(path_dir):
            os.makedirs(path_dir)

        # save checkpoints
        if self.is_ddp_training and _is_master_process():
            self.model.module.save_pretrained(f'epoch_{epoch}_checkpoint')
        else:
            self.model.save_pretrained(f'epoch_{epoch}_checkpoint')

        print("Done saved at", f'epoch_{epoch}_checkpoint')

    def prepare_dataloader(self, train_dataset, eval_dataset):

        # TODO: Prepare the training DataLoader. Initialize 'DataLoader' with 'train_dataset'
        # and the appropriate 'batch_size'.
        # Depending on whether the training is distributed (is_ddp_training),
        # use 'DistributedSampler' for 'sampler' argument, else use 'None'.
        # Use 'DataCollatorForSeq2Seq' for 'collate_fn', passing 'tokenizer', padding settings and pad_to_multiple_of to 8, and return_tensors="pt"
        # Also add drop_last to True.

        data_trainloader = DataLoader(
        train_dataset,
        batch_size=self.batch_size,
        sampler=DistributedSampler(train_dataset) if self.is_ddp_training else None,
        collate_fn=DataCollatorForSeq2Seq(
            tokenizer=self.tokenizer, padding=True, pad_to_multiple_of=8, return_tensors="pt"
        ),
        drop_last=True
    ) #Submission code

        # TODO: Prepare the evaluation DataLoader. Initialize 'DataLoader' with 'eval_dataset',
        # the appropriate 'batch_size', and 'SequentialSampler' for 'sampler'.
        # Use 'DataCollatorForSeq2Seq' for 'collate_fn', passing 'tokenizer', padding settings and pad_to_multiple_of to 8, and return_tensors="pt".
        # Also add drop_last to True.

        data_testloader = DataLoader(
        eval_dataset,
        batch_size=self.batch_size,
        sampler=SequentialSampler(eval_dataset),
        collate_fn=DataCollatorForSeq2Seq(
            tokenizer=self.tokenizer, padding=True, pad_to_multiple_of=8, return_tensors="pt"
        ),
        drop_last=True
    ) #Submission code

        return data_trainloader, data_testloader

    def _eval(self, eval_dataloader, epoch: int):
        avg_loss = 0
        model.eval()
        if _is_master_process():
            eval_progress_bar = tqdm(
                eval_dataloader, desc=f"Epoch {epoch + 1} [Evaluation]", position=0, leave=False)
        else:
            eval_progress_bar = eval_dataloader


        for batch in eval_progress_bar:
            with self.ctx:
                with torch.no_grad():
                    if not self.is_ddp_training:
                        outputs = self.model(**batch.to(self.gpu_id))
                    else:
                        outputs = self.model(**batch)
            avg_loss += outputs.loss.item()
        avg_loss = avg_loss/(len(eval_dataloader))
        return avg_loss

    def run(self, data_path: str, size_valid_set: float,  seed: int = 123):
        """
        Run the training process.

        Returns:
            None
        """
        # Load dataset
        train_dataset, eval_dataset = create_datasets(
            tokenizer=self.tokenizer,
            max_length=self.max_length,
            data_path=data_path,
            size_valid_set=size_valid_set,
            seed=seed,
        )

        train_dataloader, eval_dataloader = self.prepare_dataloader(
            train_dataset, eval_dataset)
    

        if self.is_ddp_training:
            self._set_ddp_training()

        self.optimizer = torch.optim.AdamW(
            self.model.parameters(), lr=learning_rate)

        for epoch in range(self.num_epochs):

            if self.is_ddp_training:
                train_dataloader.sampler.set_epoch(epoch)

            train_loss = self._run_epoch(train_dataloader, epoch)
            if self.is_ddp_training:
                dist.barrier() 
            if _is_master_process() or (epoch == self.num_epochs - 1):
                eval_loss = self._eval(
                    eval_dataloader=eval_dataloader, epoch=epoch)

                print(
                    f"epoch = {epoch+1} | avg_train_loss = {train_loss} | eval_loss = {eval_loss}")
            
            if _is_master_process():
                self._save_checkpoint(epoch=epoch+1)

def load_tokenizer_from_pretrained_model(model_path):

    config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
    architecture = config.architectures[0]
    tokenizer = AutoTokenizer.from_pretrained(
        model_path, trust_remote_code=True, padding_size='right', device_map={"": torch.device(f"cuda:{0}")})
    tokenizer.pad_token = tokenizer.eos_token
    if _is_master_process():
        print('Completed to load config & tokenizer')

    if "Llama" in architecture:
        if _is_master_process():
            print("Setting EOS, BOS, UNK, and PAD tokens for LLama tokenizer")
        tokenizer.add_special_tokens(
            {
                "eos_token": "</s>",
                "bos_token": "</s>",
                "unk_token": "</s>",
            }
        )
        tokenizer.pad_token_id = (
            0  # unk. we want this to be different from the eos token
        )

    return tokenizer


def _is_master_process():
    ddp_rank = int(os.environ['RANK'])
    return ddp_rank == 0


def load_pretrained_model(local_rank, model_path: str = ""):
    # TODO: Load a pretrained AutoModelForCausalLM from the 'model_path'.
    # Make sure to set 'device_map' to '{"": torch.device(f"cuda:{local_rank}")}' for DDP training
    # and trust_remote_code=True.

    model = AutoModelForCausalLM.from_pretrained(
        model_path, trust_remote_code=True, device_map={"": torch.device(f"cuda:{local_rank}")}
    ) #submission code

    # TODO: Create a LoraConfig with the parameters: 
    # r=4, lora_alpha=8, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM"
    # We will then use the config to initialize a LoraModelForCasualLM with the loaded model.

    lora_config = LoraConfig(r=4, lora_alpha=8, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM")
     #submission code

    # TODO: Create LoRA model

    model = get_peft_model(model, lora_config) #submission code

    if _is_master_process():
        model.print_trainable_parameters()

    return model


if __name__ == "__main__":
    OUTPUT_DIR = "checkpoints/"

    backend = "nccl"
    model_path = 'Qwen/Qwen2.5-1.5B'
    
    if os.environ.get("DEBUG"):
        data_path = 'test_data.json'

    else:
        data_path = 'alpaca_gpt4_data.json'


    size_valid_set = 0.15
    max_length = 128
    num_epochs = 3
    batch_size = 2
    gradient_accumulation_steps = 8

    learning_rate = 3e-4
    lr_scheduler_type = 'cosine'
    num_warmup_steps = 100
    weight_decay = 0.06

    seed = 0
    log_freq = 1
    eval_freq = 150

    distributed_strategy = "ddp" if os.environ.get("ON_DDP") else "no"

    if distributed_strategy == "ddp":
        # TODO: Initialize the process group for distributed data parallelism with nccl backend.
        # After that, you should set the 'local_rank' from the environment variable 'LOCAL_RANK'.

        # Initialize the process group
        init_process_group(backend=backend) #submission code
        local_rank = int(os.environ['LOCAL_RANK']) #submission code
        
    else:
        os.environ['RANK'] = '0'
        local_rank = 0

    # Prepare model
    model = load_pretrained_model(local_rank, model_path=model_path)
    
    # Get tokenizer
    tokenizer = load_tokenizer_from_pretrained_model(model_path=model_path)

    # prepare trainer
    trainer = Trainer(
        model=model,
        num_epochs=num_epochs,
        max_length=max_length,
        batch_size=batch_size,
        gpu_id=local_rank,
        
        mixed_precision_dtype=torch.float16 if os.environ.get("ON_MP") else None,
        
        tokenizer=tokenizer,
        output_dir=OUTPUT_DIR,
        is_ddp_training=True if distributed_strategy == "ddp" else False,
        gradient_accumulation_steps=gradient_accumulation_steps,
    )

    # set ddp for wraping model
    # execute trainer
    trainer.run(
        data_path=data_path,
        size_valid_set=size_valid_set,
        seed=seed,
        
    )

    if distributed_strategy == "ddp":
        destroy_process_group()

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/138 [00:00<?, ?B/s]

trainable params: 544,768 || all params: 1,544,259,072 || trainable%: 0.0353


tokenizer_config.json:   0%|          | 0.00/7.23k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

Completed to load config & tokenizer
Load dataset....


Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/44201 [00:00<?, ? examples/s]

Map:   0%|          | 0/7801 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/8 [00:00<?, ?ba/s]

Epoch 1 [Evaluation]:  62%|██████▏   | 2420/3900 [11:14<07:00,  3.52it/s]    

##### Train with sample data

In [None]:
# train with sample dataset
!DEBUG=true python train.py

##### Challenge 1: Will LoRA enhance inference speed? (5 points)

*[No, Because LoRA mainly focus on reducing computational overhead and storage requirement.] *

##### Challenge 2: Will LoRA improve training speed? (5 points)

*[ Yes. LoRA improves training speed by freezing the majority of the model's parameters and only fine-tuning a small set of low-rank matrices] *

## Part 2: Mixed precision training

The paper "Mixed Precision Training" is a game-changer in the world of deep learning. It introduces a method that combines different numerical precisions (like 32-bit and 16-bit) during model training. By using lower precision for certain parts of the training process, such as weight updates, we can speed up computations and reduce memory requirements without sacrificing accuracy. This technique leverages the increased computational power of modern GPUs and accelerators to achieve impressive results.

### Implementation

#### Q4: Implement Mixed Precision Training (15 points)
In the provided `train.py` file, your objective is to enable mixed precision training. To achieve this, complete the assignment of the `mixed_precision_dtype`, `self.ctx` and `self.gradscaler`. You may have to modify the `_run_batch` and `_run_epoch` using `self.gradscaler` in case you are using `mixed_precision_dtype` of `torch.float16`. If you paid close attention to the coding session during week 6, you should find this task straightforward.

Once you have carried out these steps, proceed to execute the following cell to train your LoRA model with mixed precision training. You should observe significant speed improvement in training.

##### Train with sample data

In [None]:
# mixed precision training with sample dataset
!DEBUG=true ON_MP=true python train.py 

## Part 3: Distributed Training with DistributedDataParallel

When it comes to training large language models, like those used for NLP tasks, the computational requirements can be ridiculously expensive. These models often have billions of parameters and require vast amounts of data to train effectively. This is where distributed training, and more specifically DistributedDataParallel (DDP), comes into play.

Training large language models on a single GPU can be extremely time-consuming and sometimes outright impossible due to memory limitations. DDP allows us to train these models across multiple GPUs, and even across several machines. This not only speeds up the process but also allows us to train much larger models than would be possible on a single GPU.

By dividing the model and the dataset across multiple GPUs, each with its own subset of data, we can train in parallel. This significantly reduces the time required to train these large models. Furthermore, the synchronization of model parameters after each forward and backward pass ensures consistency and accuracy across all model replicas.

In this sections, we will utilize DistributedDataParallel for training large language models. Let's dive in!

### Implementation

#### Q5: Setup environment for DDP (25 points)

In the provided `train.py` file, your mission is to enable distributed training utilizing the `DistributedDataParallel` (DDP) module from PyTorch. This task involves modifying `distributed_strategy == "ddp"`, initializing the process group, establishing the local rank, and filling out the `_set_ddp_training` function. Furthermore, you are required to adapt the `load_pretrained_model` function and the `prepare_dataloader` method to be compatible with DDP training.

Once you have carried out these steps, complete the `torchrun` command below to execute the following cell to train your LoRA model on Kaggle GPU T4 x2.

In [None]:
# distributed training with sample dataset

# TODO Fill in blank "..."
#!DEBUG=true ON_DDP=true ON_MP=true torchrun ... train.py

!DEBUG=true ON_DDP=true ON_MP=true torchrun \
    --nproc_per_node=2 \  # Number of GPUs (T4 x2 in your case)
    --nnodes=1 \          # Single-node training
    --node_rank=0 \       # Node rank (use 0 for single-node)
    --master_addr="127.0.0.1" \  # Master address
    --master_port=29500 \  # Master port
    train.py


In [None]:
# distributed training with full dataset

# TODO Fill in blank "..."
#!ON_DDP=true ON_MP=true torchrun ... train.py

!ON_DDP=true ON_MP=true torchrun \
    --nproc_per_node=2 \              # Number of GPUs (2 for Kaggle T4 x2)
    --nnodes=1 \                      # Single-node training
    --node_rank=0 \                   # Node rank (use 0 for single-node)
    --master_addr="127.0.0.1" \       # Address of the master node
    --master_port=29500 \             # Port for communication
    train.py


### Inference

Once the training phase concludes, we can utilize the subsequent code to evaluate our model and generate some instructions. Let's give it a try!

from inference import generate_inference

model_path = "Qwen/Qwen2.5-1.5B"
lora_weights_path = # TODO fill folder path
instruction = # TODO  fill instruction
user_inp = # TODO: fill input 

outputs = generate_inference(instruction=instruction, user_inp=user_inp, model_path=model_path, lora_weights_path=lora_weights_path)
print(outputs)

In [None]:
from inference import generate_inference

# Path to the model
model_path = "Qwen/Qwen2.5-1.5B"

# TODO: Fill folder path for the LoRA weights
lora_weights_path = "/kaggle/working/Advanced-NLP05/assignment_02"  # Replace with the actual path where LoRA weights are stored

# TODO: Define the instruction for the model
instruction = "Provide a detailed response to the user's query."

# TODO: Provide the user input
user_inp = "Explain the process of distributed training with PyTorch's DistributedDataParallel (DDP)."

# Generate inference
outputs = generate_inference(
    instruction=instruction, 
    user_inp=user_inp, 
    model_path=model_path, 
    lora_weights_path=lora_weights_path
)

print(outputs)
