# Efficient Loading and Fine-Tuning of Large Language Models

In this notebook, we will explore techniques to efficiently load and fine-tune Large Language Models (LLMs) using Parameter-Efficient Fine-Tuning (PEFT) methods. These approaches are essential for optimizing memory usage and computational efficiency, especially when working with limited hardware resources.

In [27]:
#%pip install trl
#%pip install -U bitsandbytes
#%pip install evaluate

In [2]:
import gc
import torch
import time
import evaluate

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    pipeline,
    TrainingArguments
)
from trl import SFTTrainer, get_kbit_device_map
from datasets import load_dataset
from peft import get_peft_model, LoraConfig, AutoPeftModelForCausalLM

import torch._dynamo
torch._dynamo.config.suppress_errors = True

In [3]:
# Utility function to get model size:
def get_model_size(model):
    model_size = 0
    for param in model.parameters():
        model_size += param.nelement() * param.element_size()
    for buffer in model.buffers():
        model_size += buffer.nelement() * buffer.element_size()
    return model_size

In [28]:
!huggingface-cli login
#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

In [5]:
# Define the model name
model_name = "meta-llama/Llama-3.2-1B"
tokenizer_name = "meta-llama/Llama-3.2-1B-Instruct"

# Load tokenizer once
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [6]:
# Load model in 16-bit quantized mode
model_bf16 = AutoModelForCausalLM.from_pretrained(model_name)
print(f"bf16 model GPU VRAM usage: {get_model_size(model_bf16)/1024**3:.2f} GB")


bf16 model GPU VRAM usage: 4.60 GB


In [26]:
# Generate a response for each model
for model in [model_bf16]:
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
    start_time = time.time()
    response = pipe(
        "Write a poem on a supercomputer missing its GPUs?",
        max_new_tokens=50,
        batch_size=32,
    )
    end_time = time.time()
    print(response)
    print(f"Inference time: {end_time - start_time:.2f} seconds")

Device set to use cuda:0


[{'generated_text': 'Write a poem on a supercomputer missing its GPUs? (Solve the riddle!)\nToday is the 25th day of the year. So, this is the 25th day of the 25th day of the 25th day of the 25th day of the 25th'}]
Inference time: 3.56 seconds


In [8]:
# Free GPU memory for what we are going to do next
gc.collect()
torch.cuda.empty_cache()

## Create a conversational dataset
Here, we use `orca-math-word-problems-200k`, which is a set of 200K math problems. This dataset needs to be formatted in a proper "chat" format, which is made of a series of roles and their corresponding text. The roles are:
- **system**: it's the role where the system prompt, so the instruction regarding how the model should "behave" go;
- **user**: it's the role corresponding to an user input;
- **assistant**: this is the role corresponding to the LLM itself. Here, we put either previous LLM responses, or just blank, since the model still needs to repond.

In order to create a dataset of "chats", we apply a simple function using the `map` method from `datasets` to our whole dataset.

In [9]:
# Create system prompt
system_message = """Solve the given high school math problem by providing a clear explanation of each step leading to the final solution.

Provide a detailed breakdown of your calculations, beginning with an explanation of the problem and describing how you derive each formula, value, or conclusion. Use logical steps that build upon one another, to arrive at the final answer in a systematic manner.

# Steps

1. **Understand the Problem**: Restate the given math problem and clearly identify the main question and any important given values.
2. **Set Up**: Identify the key formulas or concepts that could help solve the problem (e.g., algebraic manipulation, geometry formulas, trigonometric identities).
3. **Solve Step-by-Step**: Iteratively progress through each step of the math problem, justifying why each consecutive operation brings you closer to the solution.
4. **Double Check**: If applicable, double check the work for accuracy and sense, and mention potential alternative approaches if any.
5. **Final Answer**: Provide the numerical or algebraic solution clearly, accompanied by appropriate units if relevant.

# Notes

- Always clearly define any variable or term used.
- Wherever applicable, include unit conversions or context to explain why each formula or step has been chosen.
- Assume the level of mathematics is suitable for high school, and avoid overly advanced math techniques unless they are common at that level.
"""

# convert to messages
def create_conversation(sample):
  return {
    "messages": [
      {"role": "system", "content": system_message},
      {"role": "user", "content": sample["question"]},
      {"role": "assistant", "content": sample["answer"]}
    ]
  }

# Load dataset from the hub
dataset = load_dataset(
    "microsoft/orca-math-word-problems-200k",
    split="train")

# Convert dataset to OAI messages
dataset = dataset.map(create_conversation, remove_columns=dataset.features, batched=False)

train_dataset = dataset.select(range(10000))
test_dataset = dataset.select(range(10001, 12000))

In [10]:
# Example data
dataset[0]

{'messages': [{'content': 'Solve the given high school math problem by providing a clear explanation of each step leading to the final solution.\n \nProvide a detailed breakdown of your calculations, beginning with an explanation of the problem and describing how you derive each formula, value, or conclusion. Use logical steps that build upon one another, to arrive at the final answer in a systematic manner.\n \n# Steps\n \n1. **Understand the Problem**: Restate the given math problem and clearly identify the main question and any important given values.\n2. **Set Up**: Identify the key formulas or concepts that could help solve the problem (e.g., algebraic manipulation, geometry formulas, trigonometric identities).\n3. **Solve Step-by-Step**: Iteratively progress through each step of the math problem, justifying why each consecutive operation brings you closer to the solution.\n4. **Double Check**: If applicable, double check the work for accuracy and sense, and mention potential alte

## LoRA parameters
As discussed during class, a way to optimize training is not to do a full fine-tuning, but do parameter efficient fine-tuning, and use adapters and train just them, reducing of more than 70% the parameters to train.

The most used way to do this PEFT with adapters is using the peft library and their implementation of the Low-Rank Adapters (LoRA) paper.

Don't worry if you do not understand anything regarding the LoRA parameters. This part will be explained in great detail in the next module. At this stage, take it from granted, and you will understand it better in the future.

Steps:
- define the configuration of the adapters
- load (a quantized?) model
- apply the adapters to the model
- define training arguments
- start a supervised fine-tuning training session and train the model
- save the newly trained adapters

In [11]:
# Configure LoRA with desired hyperparameters
lora_config = LoraConfig(
    r=32,
    lora_alpha=16,
    target_modules="all-linear",
    # important as we need to train the special tokens for the chat template of llama
    # you might need to change this for qwen or other models
    modules_to_save=["lm_head", "embed_tokens"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Apply PEFT with LoRA
model_bf16 = get_peft_model(model_bf16, lora_config).to("cuda:0")
print(model_bf16.print_trainable_parameters())
print("Applied LoRA. Model is now ready for fine-tuning.")

trainable params: 547,880,960 || all params: 1,783,695,360 || trainable%: 30.7161
None
Applied LoRA. Model is now ready for fine-tuning.


In [12]:
training_args = TrainingArguments(
    output_dir="./lora_finetune", # directory to save and repository id
    # num_train_epochs=5, # number of training epochs
    per_device_train_batch_size=1, # batch size per device during training
    gradient_accumulation_steps=1, # number of steps before performing a backward/update pass
    gradient_checkpointing=True, # use gradient checkpointing to save memory
    gradient_checkpointing_kwargs={'use_reentrant': False},  # Use newer implementation that will become the default.
    optim="adamw_torch_fused", # use fused adamw optimizer
    logging_steps=10, # log every 10 steps
    save_strategy='no',  # 'no', 'epoch' or 'steps'
    logging_strategy='steps',  # 'no', 'epoch' or 'steps'
    learning_rate=2e-4, # learning rate, based on QLoRA paper
    bf16=True, # use bfloat16 precision
    max_grad_norm=0.3, # max gradient norm based on QLoRA paper
    warmup_ratio=0.03, # warmup ratio based on QLoRA paper
    lr_scheduler_type="constant", # use constant learning rate scheduler
    max_steps=40,
    report_to='none',  # disable wandb
)

trainer = SFTTrainer(
    model=model_bf16,
    train_dataset=train_dataset,
    args=training_args,
    processing_class=tokenizer,
)

train_result = trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.


Step,Training Loss
10,1.1774
20,0.5794
30,0.5302
40,0.4967


In [13]:
# log metrics
metrics = train_result.metrics
metrics['train_samples'] = len(train_dataset)
trainer.log_metrics('train', metrics)
trainer.save_metrics('train', metrics)
trainer.save_state()

***** train metrics *****
  total_flos               =   155961GF
  train_loss               =     0.6959
  train_runtime            = 0:01:31.36
  train_samples            =      10000
  train_samples_per_second =      0.438
  train_steps_per_second   =      0.438


In [14]:
trainer.save_model()

### Model evaluation

In [22]:
# Load the evaluation metrics
bleu = evaluate.load('bleu')
#rouge = evaluate.load('rouge')
#bertscore = evaluate.load('bertscore')

In [23]:
model=model_bf16

In [24]:
# Function to format input using the system prompt
def format_input(system_message, user_prompt):
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_prompt},
    ]
    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Function to generate a model response
def generate_response(system_message, user_prompt):
    input_text = format_input(system_message, user_prompt)
    inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_length=512)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# Evaluation dataset (test split)
eval_samples = test_dataset.select(range(100))  # Evaluate on a subset of 100 samples

references = []
predictions = []

# Generate predictions
for sample in eval_samples:
    system_prompt = system_message
    user_input = sample["messages"][1]["content"]  # User's question
    reference_output = sample["messages"][2]["content"]  # Expected assistant response

    predicted_output = generate_response(system_prompt, user_input)

    references.append(reference_output)
    predictions.append(predicted_output)

# Compute evaluation metrics
bleu_result = bleu.compute(predictions=predictions, references=references)

# Print results
print(f"BLEU Score: {bleu_result['bleu']}")
