## Merge Low Rank Adapters to base model

This simple notebook demonstrates how to merge the Low Rank Adapters back into the base model and save the merged model in unquantized state.

In [1]:
%%writefile merge_lora.py
# Import necessary libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig

trained_model_path = 'output/llama-3.2-1b-instruct-guanaco-fsdp'
output_dir = trained_model_path + '_merged'

config = PeftConfig.from_pretrained(trained_model_path)
base_model_name = config.base_model_name_or_path
base_model_name

tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
)

print(f'base_model: {base_model.num_parameters()} parameters')

peft_model = PeftModel.from_pretrained(base_model, trained_model_path)

print(f'peft_model: {peft_model.num_parameters()} parameters')

merged_model = peft_model.merge_and_unload()

print(f'merged_model: {merged_model.num_parameters()}')

merged_model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
print(f'Quantized model saved to "{output_dir}".')

Overwriting merge_lora.py


In [2]:
%%writefile run_merge_lora.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=1
#SBATCH --gpus-per-task=1  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=120GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=8  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:10:00

# Include commands in output:
set -x

# Print current time and date:
date

# Print host name:
hostname

# List available GPUs:
nvidia-smi

# Construct command to run container:
export CONTAINER="singularity run --nv --home=$HOME $SINGULARITY_CONTAINER"

# Run AI scripts:
$CONTAINER python3 merge_lora.py

Overwriting run_merge_lora.slurm


Now submit the SLURM job:

In [3]:
!sbatch --job-name=$TRAINEE_USERNAME run_merge_lora.slurm

Submitted batch job 19816821


Execute `squeue` to see, if your job is already running:

In [4]:
!squeue --name=$TRAINEE_USERNAME 

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          19816821 boost_usr   martin mpfister  R       0:04      1 lrdn3261


Once your job is running, look at the output of the job using the following command (replace the number with the JOBID from above):

In [5]:
!cat slurm-19816821.out

+ date
Wed Sep 10 19:35:25 CEST 2025
+ hostname
lrdn3261.leonardo.local
+ nvidia-smi
Wed Sep 10 19:35:25 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM-64GB           On  | 00000000:8F:00.0 Off |                    0 |
| N/A   42C    P0              59W / 456W |      2MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+-------

If you want to, you can also delete the files that we create above:

In [7]:
!rm merge_lora.py run_merge_lora.slurm slurm-*.out