# Liger kernel example with DDP, Mistral 7B instruct and openassistant-guanaco dataset

In the first course, we demonstrated a speed-up by using the *unsloth* library, which contains optimized GPU kernels created by manually deriving all compute heavy math steps. *Unsloth* only works for single GPU training though. Another library that also offers optimised GPU kernels is *liger*, which has the advantage that it can also be used for multi-GPU training.

In this notebook, we demonstrate finetuning of *Mistral 7B instruct* on two GPUs using DDP and *liger kernels*. The same script is run twice, once with without and once with *liger kernels* in order to compare speed and memory usage.

#### First, we write the python code to a file:

In [1]:
%%writefile llama_guanaco_ddp_liger.py
# Import libraries
import torch
from accelerate import PartialState
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTTrainer, SFTConfig
import pynvml
import sys

if len(sys.argv) >= 2 and sys.argv[1] == '--enable-liger':
    enable_liger = True
    print('Using liger kernels.')
else:
    enable_liger = False
    print('Not using liger kernels.')

# Import liger kernels and apply automatic monkey-patching to models:
# from liger_kernel.transformers import apply_liger_kernel_to_phi3
# apply_liger_kernel_to_phi3()
# https://github.com/linkedin/Liger-Kernel?tab=readme-ov-file#getting-started
from liger_kernel.transformers import AutoLigerKernelForCausalLM

def print_gpu_utilization():
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    memory_used = []
    for device_index in range(device_count):
        device_handle = pynvml.nvmlDeviceGetHandleByIndex(device_index)
        device_info = pynvml.nvmlDeviceGetMemoryInfo(device_handle)
        memory_used.append(device_info.used/1024**3)
    print('Memory occupied on GPUs: ' + ' + '.join([f'{mem:.1f}' for mem in memory_used]) + ' GB.')


# Choose a model and load tokenizer and model (using 4bit quantization):
# model_name = "meta-llama/Llama-3.2-1B-Instruct"
model_name = "/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/meta-llama--Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = 'left'
tokenizer.pad_token = tokenizer.eos_token

# For multi-GPU training, find out how many GPUs there are and which one we should use:
ps = PartialState()
num_processes = ps.num_processes
process_index = ps.process_index
local_process_index = ps.local_process_index

if enable_liger:
    model = AutoLigerKernelForCausalLM.from_pretrained(
        model_name,
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_compute_dtype=torch.bfloat16,
        ),
        device_map={'':local_process_index},  # Changed for DDP
        attn_implementation='eager',  # 'eager', 'sdpa', or "flash_attention_2"
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
    )
else:
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_compute_dtype=torch.bfloat16,
        ),
        device_map={'':local_process_index},  # Changed for DDP
        attn_implementation='eager',  # 'eager', 'sdpa', or "flash_attention_2"
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
    )


# Load the guanaco dataset
guanaco_train = load_dataset('/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/timdettmers--openassistant-guanaco', split='train')
guanaco_test = load_dataset('/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/timdettmers--openassistant-guanaco', split='test')
# guanaco_train = load_dataset('timdettmers/openassistant-guanaco', split='train')
# guanaco_test = load_dataset('timdettmers/openassistant-guanaco', split='test')

def reformat_text(text, include_answer=True):
    question1 = text.split('###')[1].removeprefix(' Human: ')
    answer1 = text.split('###')[2].removeprefix(' Assistant: ')
    if include_answer:
        messages = [
            {'role': 'user', 'content': question1},
            {'role': 'assistant', 'content': answer1}
        ]
    else:
        messages = [
            {'role': 'user', 'content': question1}
        ]        
    reformatted_text = tokenizer.apply_chat_template(messages, tokenize=False)
    return reformatted_text

# Now, apply reformat_train(..) to both datasets:
guanaco_train = guanaco_train.map(lambda entry: {
    'reformatted_text': reformat_text(entry['text'])
})
guanaco_test = guanaco_test.map(lambda entry: {
    'reformatted_text': reformat_text(entry['text'])
})

model.config.use_cache = False  # KV cache can only speed up inference, but we are doing training.

# Add low-rank adapters (LORA) to the model:
peft_config = LoraConfig(
    task_type='CAUSAL_LM',
    r=16,
    lora_alpha=32,  # thumb rule: lora_alpha should be 2*r
    lora_dropout=0.05,
    bias='none',
    target_modules='all-linear',
)
# model = get_peft_model(model, peft_config)

training_arguments = SFTConfig(
    output_dir='output/llama-3.2-1b-instruct-guanaco-ddp-liger',
    per_device_train_batch_size=32//num_processes,  # Adjust per-device batch size for DDP
    gradient_accumulation_steps=1,
    gradient_checkpointing=True, # Gradient checkpointing improves memory efficiency, but slows down training,
        # e.g. Mistral 7B with PEFT using bitsandbytes:
        # - enabled: 11 GB GPU RAM and 8 samples/second
        # - disabled: 40 GB GPU RAM and 12 samples/second
    gradient_checkpointing_kwargs={'use_reentrant': False},  # Use newer implementation that will become the default.
    ddp_find_unused_parameters=False,  # Set to False when using gradient checkpointing to suppress warning message.
    log_level_replica='error',  # Disable warnings in all but the first process.
    optim='adamw_torch',
    learning_rate=2e-4,  # QLoRA suggestions: 2e-4 for 7B or 13B, 1e-4 for 33B or 65B
    logging_strategy='no',
    # logging_strategy='steps',  # 'no', 'epoch' or 'steps'
    # logging_steps=10,
    save_strategy='no',  # 'no', 'epoch' or 'steps'
    # save_steps=2000,
    # num_train_epochs=5,
    max_steps=20,
    bf16=True,  # mixed precision training
    report_to='none',  # disable wandb
    max_seq_length=1024,
    dataset_text_field='reformatted_text',
)

trainer = SFTTrainer(
    model=model,
    peft_config=peft_config,
    args=training_arguments,
    train_dataset=guanaco_train,
    eval_dataset=guanaco_test,
    processing_class=tokenizer,
)

if process_index == 0:  # Only print in first process.
    if hasattr(trainer.model, "print_trainable_parameters"):
        trainer.model.print_trainable_parameters()

# eval_result = trainer.evaluate()
# if process_index == 0:
#     print("Evaluation on test dataset before finetuning:")
#     print(eval_result)

train_result = trainer.train()
if process_index == 0:
    print("Training result:")
    print(train_result)

# eval_result = trainer.evaluate()
# if process_index == 0:
#     print("Evaluation on test dataset after finetuning:")
#     print(eval_result)

# Print memory usage once per node:
if local_process_index == 0:
    print_gpu_utilization()

# # Save model in first process only:
# if process_index == 0:
#     trainer.save_model()

Writing llama_guanaco_ddp_liger.py


#### Next, we write the SLURM script and submit the script to the scheduler again:

In [2]:
%%writefile run_llama_guanaco_ddp_liger.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=1
#SBATCH --gpus-per-task=2  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=240GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=16  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:30:00

# Load conda:
eval "$(/leonardo/pub/userexternal/mpfister/miniforge3/bin/conda shell.bash hook)"
conda activate /leonardo/pub/userexternal/mpfister/conda_env_martin24

# Include commands in output:
set -x

# Print current time and date:
date

# Print host name:
hostname

# List available GPUs:
nvidia-smi

# Construct command to run container:
# export CONTAINER="singularity run --nv --home=$HOME $SINGULARITY_CONTAINER"
# export CONTAINER="singularity run --nv --home=$HOME /leonardo/pub/userexternal/mpfister/martin38.sif"

# Set environment variables for communication between nodes:
export MASTER_PORT=$(shuf -i 20000-30000 -n 1)  # Choose a random port
export MASTER_ADDR=$(scontrol show hostnames ${SLURM_JOB_NODELIST} | head -n 1)
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Set launcher and launcher arguments:
export LAUNCHER="torchrun \
    --nnodes=$SLURM_JOB_NUM_NODES \
    --nproc_per_node=$SLURM_GPUS_ON_NODE \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend=c10d"
# Set training script that will be executed:
export PROGRAM="llama_guanaco_ddp_liger.py"

# Run:
time srun bash -c "$LAUNCHER $PROGRAM"
time srun bash -c "$LAUNCHER $PROGRAM --enable-liger"

Writing run_llama_guanaco_ddp_liger.slurm


In [3]:
!sbatch --job-name=$TRAINEE_USERNAME run_llama_guanaco_ddp_liger.slurm

Submitted batch job 19826308


In [5]:
!squeue --name=$TRAINEE_USERNAME 

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          19826308 boost_usr trainee0 a08trb02  R       0:11      1 lrdn0775


In [7]:
!cat slurm-19826308.out

+ date
Thu Sep 11 08:07:33 CEST 2025
+ hostname
lrdn0775.leonardo.local
+ nvidia-smi
Thu Sep 11 08:07:33 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM-64GB           On  | 00000000:1D:00.0 Off |                    0 |
| N/A   43C    P0              64W / 476W |      2MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+-------

#### Finally, we can clean up and delete the files that we just created:

In [9]:
!rm llama_guanaco_ddp_liger.py run_llama_guanaco_ddp_liger.slurm slurm-*.out

rm: cannot remove 'slurm-*.out': No such file or directory
