# **Parameter Efficient Fine-Tuning**
In this hands-on tutorial, we will explore the concept of Parameter Efficient Fine-Tuning (PEFT) and learn how it can significantly improve the efficiency of fine-tuning training. PEFT involves training only a specific part of the model or an adapter, resulting in faster and more resource-efficient training.

To demonstrate the effectiveness of PEFT, we will use the same training scenario as the first hands-on tutorial of SpeLLM. By comparing the performance of PEFT with traditional fine-tuning techniques, we can gain insights into the benefits and trade-offs of using PEFT in different scenarios.

In [None]:
from pathlib import Path
import os
import datasets
from transformers import AutoTokenizer, AutoModelForSequenceClassification, AutoModelForCausalLM, AutoModelForSeq2SeqLM
from torch.utils.data import Dataset, DataLoader, IterableDataset
import torch
from tqdm.notebook import tqdm
import random
from utils import seed_everything
import re
from peft import LoraConfig, get_peft_model

DSDIR = Path(os.environ['DSDIR'])
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
seed_everything(53)

## **Freezing Layers: RoBERTa on classification task**
In this section, we will explore the concept of freezing layers in the RoBERTa model for a classification task (first part of first SpeLLM hands-on). Freezing layers refers to the process of preventing certain layers of the model from being updated during training. This can be useful when we want to fine-tune only specific parts of the model or when we have limited computational resources.

In [None]:
# Initialize the model and its tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
    DSDIR / "HuggingFace_Models/FacebookAI/roberta-base", num_labels=2
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(DSDIR / "HuggingFace_Models/FacebookAI/roberta-base")

In [None]:
# Load the dataset
imdb_dataset = datasets.load_from_disk(DSDIR / "HuggingFace/imdb/plain_text")

class IMDBDataset(Dataset):
    def __init__(self, hf_dataset):
        self.hf_dataset = hf_dataset
        
    def __len__(self) -> int:
        """Return the number of element of the dataset"""
        return len(self.hf_dataset)
    
    def __getitem__(self, idx) -> tuple[str, torch.Tensor]:
        """Return the input for the model and the label for the loss"""
        hf_element = self.hf_dataset[idx]
        
        model_inp = hf_element["text"]
        label = hf_element["label"]
        
        return model_inp, label

def collate_fn(batch):
    text_list = [element[0] for element in batch]
    label_list = [element[1] for element in batch]
    
    model_inp = tokenizer(
        text_list, return_tensors="pt", padding=True, truncation=True, max_length=512
    )
    label_tens = torch.LongTensor(label_list)
    return model_inp, label_tens

dataset = IMDBDataset(imdb_dataset["train"])
dataloader = DataLoader(
    dataset,
    batch_size=16,
    num_workers=4,
    prefetch_factor=2,
    shuffle=True,
    collate_fn=collate_fn
)

In [None]:
# Initialize training
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)


def train_loop(model, dataloader, criterion, optimizer, test=False):
    model.train()
    # tqdm for a nice progress bar
    loop = tqdm(dataloader)

    for i, (model_inp, labels) in enumerate(loop):
        optimizer.zero_grad()

        model_inp = model_inp.to("cuda")
        labels = labels.to("cuda")

        out = model(**model_inp)

        loss = criterion(out.logits, labels)

        loss.backward()
        optimizer.step()

        # print next to progress bar
        loop.set_postfix(loss=loss.item())

        if i >= 50 and test:
            loop.close()
            break

    return model

To estimate the time of one epoch and the GPU memory consumption of the fine-tuning as we saw it on the first hands-on, let's run a few iterations.

In [None]:
train_loop(model, dataloader, criterion, optimizer, test=True)
print(f"Max memory: {torch.cuda.max_memory_allocated(device='cuda')/(1024**3)}")
torch.cuda.reset_peak_memory_stats()

To optimize GPU memory usage, we can freeze specific parts of the model. Before doing so, let's examine the model's architecture.

In [None]:
model

In [None]:
model.roberta.encoder.layer[0]

We can inspect the parameters of a specific layer using the following code snippet:

In [None]:
model.roberta.encoder.layer[0].attention.self.query.weight

The requires_grad=True indicates that gradients will be computed during training, allowing for weight updates. To freeze a specific part of the model, we can set these parameters to False.

In [None]:
model.roberta.encoder.layer[0].attention.self.query.weight.requires_grad = False

We can define a function that freezes all parameters of a HuggingFace model up to a specified layer. This function utilizes regular expressions to target the embedding parameters and all layer parameters leading up to the specified layer.

In [None]:
def freeze_layers(model, nb_freeze_layer = 4):
    for name, params in model.named_parameters():
        if re.search(r"embed", name) is not None:
            params.requires_grad = False
        elif re.search(r"\.(\d+)\.", name) is not None:
            if (
                int(re.search(r"\.(\d+)\.", name).group(1)) < nb_freeze_layer
            ):
                params.requires_grad = False
                
    return model

In [None]:
model = freeze_layers(model, nb_freeze_layer = 8)

In [None]:
model.roberta.encoder.layer[0].attention.self.query.weight.requires_grad

In [None]:
model.roberta.encoder.layer[7].attention.self.query.weight.requires_grad

In [None]:
model.roberta.encoder.layer[8].attention.self.query.weight.requires_grad

To estimate the time required for one epoch and the GPU memory consumption during training, let's run a few iterations again.

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=1e-5)

In [None]:
train_loop(model, dataloader, criterion, optimizer, test=True)
print(f"Max memory: {torch.cuda.max_memory_allocated(device='cuda')/(1024**3)}")
torch.cuda.reset_peak_memory_stats()

## **LoRA: Phi-2 as a chatbot**
In this section, we will explore the application of LoRA (Low Rank Adaptation) on Phi-2 to fine-tune it efficiently for the roleplay dataset. LoRA is a technique that allows us to train only an adapter, resulting in faster and more resource-efficient training. By applying LoRA on Phi-2, we can leverage its powerful language representation capabilities and adapt it specifically for chatbot tasks.

In [None]:
# Initialize the model and its tokenizer
model = AutoModelForCausalLM.from_pretrained(
    DSDIR / "HuggingFace_Models/microsoft/phi-2",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,  # Allow using code that was not written by HuggingFace
    attn_implementation="flash_attention_2"  # Optimize the model with Flash Attention
).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(DSDIR / "HuggingFace_Models/microsoft/phi-2")

In [None]:
# Load the dataset
roleplay_dataset = datasets.load_from_disk(DSDIR / "HuggingFace/hieunguyenminh/roleplay")


def count_tokens(hf_dataset, tokenizer):
    total_tokens = 0
    loop = tqdm(hf_dataset)
    for element in loop:
        nb_token_element = len(tokenizer(element['text'])["input_ids"])
        total_tokens += nb_token_element
        
        loop.set_postfix(tokens_count=total_tokens)
        
    return total_tokens


nb_tokens = count_tokens(roleplay_dataset['train'], tokenizer)


class RoleplayDataset(IterableDataset):

    def __init__(self, tokenizer, hf_dataset, seq_length=1024, nb_tokens=3160542):
        self.tokenizer = tokenizer
        self.separator = tokenizer.eos_token_id  # The token that will seperate different sample
        self.hf_dataset = hf_dataset
        self.idx_iterator = iter(random.sample(range(len(hf_dataset)), len(hf_dataset)))
        self.seq_length = seq_length
        self.nb_tokens = nb_tokens
    
    def __len__(self):
        return self.nb_tokens // self.seq_length

    def get_next_sample(self):
            """Retrieves the next sample from the dataset and tokenize it."""
            idx = next(self.idx_iterator)
            text = self.hf_dataset[idx]["text"]
            return self.tokenizer(text)['input_ids'] + [self.separator]

    def __iter__(self):
        next_sample_ids = None
        all_token_ids = []
        idx = 0

        while idx < self.__len__():
            if next_sample_ids is None:
                next_sample_ids = self.get_next_sample()

            if len(all_token_ids) + len(next_sample_ids) <= self.seq_length:
                # if the next HF_dataset sample can fit in the current dataset sample
                # we add it
                all_token_ids += next_sample_ids
                next_sample_ids = None
                
            else:
                # if the next HF_dataset sample can't fit in the current dataset
                # sample, we add what we can in the dataset sample and then we yield it
                # note: we add one more element compared to seq_length to return to
                # seq_length when generating inputs and targets (see train_collate())
                idx_break = self.seq_length - len(all_token_ids)
                all_token_ids += next_sample_ids[: idx_break + 1]
                next_sample_ids = next_sample_ids[idx_break + 1 :]
                
                model_inp = torch.tensor(all_token_ids[:-1], dtype=torch.int64)
                labels = torch.tensor(all_token_ids[1:], dtype=torch.int64) 
                yield model_inp, labels

                all_token_ids = []
                idx += 1


dataset = RoleplayDataset(tokenizer, roleplay_dataset['train'], seq_length=512, nb_tokens=nb_tokens)
dataloader = DataLoader(
    dataset,
    batch_size=4,
    num_workers=1,
    prefetch_factor=2,
)

In [None]:
# Prepare training
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=5e-6)


def prepare_for_loss(logits, labels):
    """Unfold the Tensors to compute the CrossEntropyLoss correctly"""
    batch_size, seq_length, vocab_size = logits.shape
    logits = logits.view(batch_size * seq_length, vocab_size)
    labels = labels.view(batch_size * seq_length)
    return logits, labels


def train_loop(model, dataloader, criterion, optimizer, test=False):
    model.train()
    # tqdm for a nice progress bar
    loop = tqdm(dataloader)

    for i, (model_inp, labels) in enumerate(loop):
        optimizer.zero_grad()

        model_inp = model_inp.to("cuda")
        labels = labels.to("cuda")

        logits = model(model_inp).logits

        logits, labels = prepare_for_loss(logits, labels)
        loss = criterion(logits, labels)

        loss.backward()
        optimizer.step()

        # print next to progress bar
        loop.set_postfix(loss=loss.item())

        if i >= 50 and test:
            loop.close()
            break

    return model

In [None]:
train_loop(model, dataloader, criterion, optimizer, test=True)
print(f"Max memory: {torch.cuda.max_memory_allocated(device='cuda')/(1024**3)}")
torch.cuda.reset_peak_memory_stats()

Now, before applying LoRA on Phi-2, we need to identify the linear layers we want to target. One common approach is to focus on the linear layer of the attention mechanism, which transforms the input into K, Q, and V matrices.

In [None]:
model

Now we can apply LoRA on the model. It is important to note that `r` represents the rank of the LoRA adapters, while `lora_alpha`, `lora_dropout`, and `bias` are hyperparameters that need to be considered (we use common value here).

In [None]:
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.config.use_cache = False  # People advised me to do that, I don't remember why 

To estimate the time required for one epoch and the GPU memory consumption during training, let's run a few more iterations.

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=5e-6)

In [None]:
train_loop(model, dataloader, criterion, optimizer, test=True)
print(f"Max memory: {torch.cuda.max_memory_allocated(device='cuda')/(1024**3)}")
torch.cuda.reset_peak_memory_stats()