##  Setup and Imports

This cell sets up the environment and imports necessary libraries for the project.

- Installs required packages (accelerate, transformers, datasets, peft)
- Imports necessary Python libraries (os, torch, datetime, datasets, warnings)
- Imports specific modules from transformers and peft
- Suppresses warnings to keep the output clean

## Introduction to Parameter Efficient Fine-Tuning


As large language models (LLMs) like GPT-3.5, LLaMA2, and PaLM2 increase in size, fine-tuning them for specific NLP tasks becomes more resource-intensive.

Parameter-Efficient Fine-Tuning (PEFT) helps reduce computational and memory demands by only adjusting a small set of parameters while keeping most of the model frozen. This prevents losing pre-learned information and allows fine-tuning with minimal compute.

The modular design of PEFT allows the same pretrained model to be used for multiple tasks by adding small, task-specific parameters, avoiding the need for storing full model copies.

The PEFT library simplifies the process by integrating techniques such as LoRA, Prefix Tuning, AdaLoRA, and more with popular tools like Transformers and Accelerate, enabling scalable fine-tuning of large models.


In [2]:
! pip install --upgrade transformers



In [3]:
! pip install -q accelerate pyboxen datasets==2.17.0 peft==0.4.0


In [4]:
import os
import torch
from datetime import datetime
from datasets import load_dataset
import warnings
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import prepare_model_for_kbit_training, PeftModel, LoraConfig, get_peft_model

warnings.filterwarnings('ignore')

## Download Dataset

In [5]:
import requests

# URLs for the raw files
test_url = "https://raw.githubusercontent.com/initmahesh/MLAI-community-labs/main/Class-Labs/Lab-3(Fine-tuning-PEFT-LoRA)/formatted_test_set.jsonl"
train_url = "https://raw.githubusercontent.com/initmahesh/MLAI-community-labs/main/Class-Labs/Lab-3(Fine-tuning-PEFT-LoRA)/formatted_train_set.jsonl"

# Download the files
test_file_path = "/content/formatted_test_set.jsonl"
train_file_path = "/content/formatted_train_set.jsonl"

# Download and save the test file
response_test = requests.get(test_url)
with open(test_file_path, "wb") as f_test:
    f_test.write(response_test.content)

# Download and save the train file
response_train = requests.get(train_url)
with open(train_file_path, "wb") as f_train:
    f_train.write(response_train.content)

## Load Model and Tokenizer

This cell loads the pre-trained model and tokenizer.

- Specifies the model name: "microsoft/phi-1_5" (a lightweight model)
- Loads the pre-trained causal language model
- Loads the corresponding tokenizer
- Sets the pad token to be the same as the end-of-sequence token

In [6]:
# Load the model and tokenizer
model_name = "microsoft/phi-1_5"  # Lightweight model

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    low_cpu_mem_usage=True,
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

config.json:   0%|          | 0.00/736 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.84G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/74.0 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

# Evaluate Out-of-the-Box Model Performance

This cell sets up and runs inference using the original, untrained model:
- Defines a function to generate responses from the model
- Loads questions from the test set
- Extracts relevant parts related to "Governing Law"
- Constructs a focused prompt for the model
- Runs inference with the out-of-the-box model and displays the result

In [7]:
# Function to generate response
def generate_response(prompt, model, tokenizer):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)
    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_new_tokens=100,
            num_return_sequences=1,
            temperature=0.7,
            do_sample=True
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Load test question
import json
with open(test_file_path, "r") as f:
    questions = [json.loads(line) for line in f]

question = "What is the Governing Law? Provide just the name of the governing law"

relevant_parts = [
    part['content'] for part in questions
    if 'Governing Law' in part['content'] or 'governed by' in part['content'].lower()
]

prompt = f"""
As a legal expert, analyze the following excerpts from a Master Service Agreement:

{' '.join(relevant_parts)}

{question}

Provide a concise answer.
"""

print("Out-of-the-box model response:")
print(generate_response(prompt, model, tokenizer))

Out-of-the-box model response:


This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.



As a legal expert, analyze the following excerpts from a Master Service Agreement:

You are a seasoned lawyer with a strong background in Master Service Agreement agreement.\ 
    Your expertise is required to analyze a Master Service Agreement agreement and answer a question based on that Master Service Agreement agreement.    
    The Master Service Agreement agreementagreement is provided in JSON format where each object two keys 'page_number' and 'content', 'page_number' key contains the page number of the page of the Master Service Agreement agreementagreement and 'content' key contains the content on that page.
    
        
    The Master Service Agreement agreement is mentioned below in triple quotes.    Master Service Agreement agreement: '''{'page_number': 7, 'content': 'IN WITNESS WHEREOF, the parties to this Agreement acknowledge they have read this Agreement and understand and agree to be bound by its terms and conditions and hereby execute it through their duly authorize

## Prepare Dataset

This cell loads and prepares the dataset for training and validation.

- Loads train and validation datasets from JSON files
- Defines functions to format prompts and tokenize inputs
- Applies tokenization to both train and validation datasets

**Tokenization:**

* The tokenizer function is used to convert the input prompt into a sequence of
tokens (numerical representations of words or characters).
format_prompt(prompt) ensures that the prompt is in string format before being tokenized.

**Truncation and Padding:**

* truncation=True ensures that if the prompt is longer than the maximum length, it gets truncated.
max_length=512 sets the maximum length of the tokenized input to 512 tokens.
padding="max_length" pads shorter sequences to 512 tokens, ensuring all inputs are of equal length.

**Labels Creation:**

* The line result["labels"] = result["input_ids"].copy() makes a copy of the input_ids (the tokenized version of the prompt). The labels are typically used for training language models, where the goal is to predict the next token.

**Return:**

* The function returns the tokenized result, which includes both input_ids and labels needed for model training.

In [8]:
import pandas as pd

# Load the training data file into a DataFrame
train_df = pd.read_json(train_file_path, lines=True)

train_df.head()

Unnamed: 0,role,content
0,system,You are a seasoned lawyer with a strong backgr...
1,user,Read the question properly and analyze the Ma...
2,assistant,"{“page_number”: 5, “content”: “more than thirt..."
3,system,You are a seasoned lawyer with a strong backgr...
4,user,Read the question properly and analyze the Ma...


In [9]:
# Load and prepare the dataset
train_dataset = load_dataset('json', data_files=train_file_path, split='train')
validation_dataset = load_dataset('json', data_files=test_file_path, split='train')

def format_prompt(mess):
    return str(mess)
# ToDo: Explain what this function does


def generate_and_tokenize_prompt(prompt):
    result = tokenizer(
        format_prompt(prompt),
        truncation=True,
        max_length=512,
        padding="max_length",
    )
    result["labels"] = result["input_ids"].copy()
    return result

# Set the padding token
tokenizer.pad_token = tokenizer.eos_token

tokenized_train_dataset = train_dataset.map(generate_and_tokenize_prompt)
tokenized_validation_dataset = validation_dataset.map(generate_and_tokenize_prompt)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/51 [00:00<?, ? examples/s]

Map:   0%|          | 0/51 [00:00<?, ? examples/s]

## Prepare Model for Training

This cell prepares the model for fine-tuning using LoRA (Low-Rank Adaptation).

- Enables gradient checkpointing to save memory
- Prepares the model for 8-bit training
- Configures and applies LoRA settings

## What is LoRA


Large Language Models (LLMs) are powerful tools for processing and understanding language, but fine-tuning them for specific tasks can be challenging because of their enormous size and computational demands. This is where Low-Rank Adaptation (LoRA) comes in, offering an efficient solution for fine-tuning LLMs without needing to adjust every parameter.

Instead of modifying the entire model, LoRA focuses on a small, manageable subset of parameters. Here’s a simplified breakdown of how it works:

1. Normally, LLMs use a large matrix of parameters (W0) to make decisions. This matrix is huge and computationally expensive to adjust.

2. LoRA introduces two smaller matrices, A and B, which are much narrower than W0. These matrices represent a low-rank update to the model.

3. Instead of retraining the entire matrix W0, LoRA modifies only these smaller matrices, making the fine-tuning process much faster and more efficient. The result is a model update that’s nearly as effective as full fine-tuning but requires significantly fewer computational resources.

4. In a typical LLM layer, the output is calculated as output = W0x + b0. LoRA adds a new term, BAx, where A and B are the smaller matrices. This allows the model to adapt to new tasks without modifying the original large matrix W0.

![Image Description 2](https://drive.google.com/uc?export=view&id=1XnPMJzKwHun6SGkoUgxDtAzozcIxRRTA)


*Source: [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)*


# Hyper-Parameters for LoraConfig

## r (Rank):

This defines the rank of the low-rank decomposition matrices A and B. A higher rank means more parameters to fine-tune and potentially better performance, but at the cost of increased memory and compute.
Value: 16 is moderate rank value balancing efficiency and expressiveness.

##lora_alpha:

This is a scaling factor applied to the updates from the low-rank matrices A and B before adding them to the original weight matrix W0. It controls how much influence the LoRA layers have over the original model.
Value: 32, gives moderate influence to the LoRA updates.

## target_modules:

These are the specific layers in the model where LoRA is applied. Only these layers will be fine-tuned with LoRA. Examples here include:
1. "o_proj": The output projection layer.
2. "qkv_proj": The query, key, and value projections in the transformer.
3. "gate_up_proj", "up_proj", "down_proj", "lm_head".

## bias:

Determines whether LoRA will also adjust the bias terms in the model. In this case, "none" indicates that the bias terms are not fine-tuned, meaning only weights are updated.


## lora_dropout:

The dropout rate applied to LoRA layers during training. Dropout helps regularize the model by randomly ignoring some updates during training, reducing overfitting.
Value: 0.05 (5% dropout), meaning that 5% of the connections are dropped during fine-tuning.

## task_type:

The task type that the model is being fine-tuned for. In this case, "CAUSAL_LM" means the task is Causal Language Modeling, where the model is predicting the next word or token in a sequence.



In [10]:
# Prepare model for training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

# Configure LoRA
config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=[
        "o_proj",
        "qkv_proj",
        "gate_up_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, config)

## Set Up Trainer and Start Training

This cell sets up the training configuration and starts the fine-tuning process.

- Sets up TrainingArguments with various hyperparameters
- Creates a Trainer instance with the model, datasets, and training arguments
- Starts the training process

![Image Description](https://drive.google.com/uc?export=view&id=1evbDx1GhJy907b1BEs5SbMcLSXl7hiYE)

*Source: [Guide to Fine-Tuning LLMs with LoRA and QLoRA](https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora)*


In [11]:
# Set up the trainer
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

output_dir = "./phi-1_5-finetune"

trainer = Trainer(
    model=model,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
    args=TrainingArguments(
        output_dir=output_dir,
        num_train_epochs=1,
        warmup_steps=1,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=1,
        gradient_checkpointing=True,
        max_steps=51,
        learning_rate=2e-4,
        logging_dir="./logs",
        save_strategy="steps",
        save_steps=50,
        evaluation_strategy="epoch",
        eval_steps=51,
        do_eval=True,
        logging_steps=5,
        run_name=f"phi-1_5-finetune-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"
    ),
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

# Train the model
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


Epoch,Training Loss,Validation Loss
1,2.942,2.627045


TrainOutput(global_step=51, training_loss=2.778151320476158, metrics={'train_runtime': 172.2191, 'train_samples_per_second': 0.592, 'train_steps_per_second': 0.296, 'total_flos': 407779657383936.0, 'train_loss': 2.778151320476158, 'epoch': 1.9615384615384617})

## Memory Cleanup

This cell defines and runs a function to clean up memory after training.

- Defines a clean_memory function to clear Python garbage and CUDA cache
- Prints memory usage information
- Calls the clean_memory function


In [12]:
import gc
import torch

def clean_memory():
    # Clear Python garbage
    gc.collect()

    # Clear CUDA cache if CUDA is available
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

        # Only use ipc_collect if it's available (it's not available in all PyTorch versions)
        if hasattr(torch.cuda, 'ipc_collect'):
            torch.cuda.ipc_collect()

    # Print memory usage info
    if torch.cuda.is_available():
        print(f"CUDA memory allocated: {torch.cuda.memory_allocated() / 1e9:.2f} GB")
        print(f"CUDA memory reserved: {torch.cuda.memory_reserved() / 1e9:.2f} GB")

    print("Memory cleaned")

# Run the cleaning function
clean_memory()

CUDA memory allocated: 5.70 GB
CUDA memory reserved: 5.73 GB
Memory cleaned


## Load Fine-tuned Model

This cell loads the fine-tuned model and prepares it for inference.

- Reloads the base model and tokenizer
- Loads the fine-tuned weights
- Merges the base model with fine-tuned weights
- Prepares an evaluation prompt from the test set

In [13]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the base model and tokenizer
model_name = "microsoft/phi-1_5"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Load the fine-tuned weights
from peft import PeftModel
ft_model = PeftModel.from_pretrained(model, "./phi-1_5-finetune/checkpoint-50")

# Merge the base model with the fine-tuned weights
merged_model = ft_model.merge_and_unload()

# Prepare for inference
import json
with open(test_file_path, "r") as f:
    question = [json.loads(line) for line in f]

eval_prompt = question[6]["content"] + '\n' + question[7]["content"]

merged_model.eval()

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2048)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-23): 24 x PhiDecoderLayer(
        (self_attn): PhiSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (v_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (dense): Linear(in_features=2048, out_features=2048, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=2048, out_features=8192, bias=True)
          (fc2): Linear(in_features=8192, out_features=2048, bias=True)
        )
        (input_layernorm): LayerNorm((2048,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.0, inplace=False)
      )
    )
    (final_layernorm): LayerNorm((2048,

# Compare Out-of-the-Box and Fine-Tuned Model Responses

This cell compares the performance of the original and fine-tuned models:
- Reloads the test questions and constructs the prompt (for consistency)
- Defines a function to generate responses from any given model
- Runs inference with the original out-of-the-box model
- Runs inference with the fine-tuned model
- Displays both responses for direct comparison
- Allows for analysis of improvements in accuracy, relevance, and quality

In [None]:
import json
from pyboxen import boxen

# Load the question
with open(test_file_path, "r") as f:
    questions = [json.loads(line) for line in f]

# Extract the relevant question
question = "What is the Governing Law? Provide just the name of the governing law"

# Find the relevant parts of the agreement
relevant_parts = [
    part['content'] for part in questions
    if 'Governing Law' in part['content'] or 'governed by' in part['content'].lower()
]

# Combine the relevant parts into a shorter prompt
short_prompt = f"""
As a legal expert, analyze the following excerpts from a Master Service Agreement:

{' '.join(relevant_parts)}

{question}

Provide a concise answer.
"""

def generate_response(prompt, model):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048)

    with torch.no_grad():
        outputs = model.generate(
            inputs.input_ids,
            max_new_tokens=100,
            num_return_sequences=1,
            temperature=0.7,
            do_sample=True
        )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Print formatted outputs using pyboxen
print(boxen(
    "QUESTION",
    title="Question",
    color="blue",
    padding=1
))
print(boxen(
    question,
    padding=1
))

print(boxen(
    "PROMPT",
    title="Generated Prompt",
    color="yellow",
    padding=1
))
print(boxen(
    short_prompt,
    padding=1
))

print(boxen(
    "OUT-OF-THE-BOX MODEL RESPONSE",
    title="Original Model Output",
    color="green",
    padding=1
))
print(boxen(
    generate_response(short_prompt, model),
    padding=1
))

print(boxen(
    "FINE-TUNED MODEL RESPONSE",
    title="Fine-tuned Model Output",
    color="magenta",
    padding=1
))
print(boxen(
    generate_response(short_prompt, merged_model),
    padding=1
))

[34m╭─[0m[34m Question [0m[34m──[0m[34m─╮[0m                                                                                                   
[34m│[0m              [34m│[0m                                                                                                   
[34m│[0m   QUESTION   [34m│[0m                                                                                                   
[34m│[0m              [34m│[0m                                                                                                   
[34m╰──────────────╯[0m                                                                                                   



[37m╭───────────────────────────────────────────────────────────────────────────╮[0m                                      
[37m│[0m                                                                           [37m│[0m                                      
[37m│[0m   What is the Governing Law? Provide just the name of the governing law   [37m│[0m                                      
[37m│[0m                                                                           [37m│[0m                                      
[37m╰───────────────────────────────────────────────────────────────────────────╯[0m                                      



[33m╭─[0m[33m Generated Prompt [0m[33m─╮[0m                                                                                             
[33m│[0m                    [33m│[0m                                                                                             
[33m│[0m       PROMPT       [33m│[0m                                                                                             
[33m│[0m                    [33m│[0m                                                                                             
[33m╰────────────────────╯[0m                                                                                             



[37m╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮[0m
[37m│[0m                                                                                                                 [37m│[0m
[37m│[0m                                                                                                                 [37m│[0m
[37m│[0m   As a legal expert, analyze the following excerpts from a Master Service Agreement:                            [37m│[0m
[37m│[0m                                                                                                                 [37m│[0m
[37m│[0m   You are a seasoned lawyer with a strong background in Master Service Agreement agreement.\                    [37m│[0m
[37m│[0m       Your expertise is required to analyze a Master Service Agreement agreement and answer a question based    [37m│[0m
[37m│[0m   on that Master Service Agreement agreement.               

[32m╭─[0m[32m Original Model Output [0m[32m──────────[0m[32m─╮[0m                                                                              
[32m│[0m                                   [32m│[0m                                                                              
[32m│[0m   OUT-OF-THE-BOX MODEL RESPONSE   [32m│[0m                                                                              
[32m│[0m                                   [32m│[0m                                                                              
[32m╰───────────────────────────────────╯[0m                                                                              

