# DDP example with Phi-3.5 mini instruct and openassistant-guanaco dataset
In this example a network is trained on multiple GPUs with the help of DDP (Distributed Data Parallel). This approach allows to train networks that fit into the memory of a single GPU on multiple GPUs in parallel in order to speed up the training.

If we want to use multiple GPUs, we need to write the code to a file and submit the job to the SLURM scheduler, because the JupyterHub that we are using today does not have access to any GPU. This example uses two GPUs on one node, but could be extended simply by adjusting the number of GPUs and nodes in the SLURM script.

#### First, we write the python code to a file:

In [1]:
%%writefile phi3_guanaco_ddp.py
# Import libraries
import torch
from accelerate import PartialState
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTTrainer, SFTConfig
import pynvml
import psutil

def set_cpu_affinity(local_rank):
    # Leonardo has two NUMA nodes, CPUs 0-15 and 16-31.
    # All four GPUs are connected to the first NUMA node.
    # To find out which GPU belongs to which NUMA node, use the following command:
    # `nvidia-smi topo -mp`
    Leonardo_GPU_CPU_map = {
        0: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15],
        1: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15],
        2: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15],
        3: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 14, 15],
    }
    cpu_list = Leonardo_GPU_CPU_map[local_rank]
    print(f"Local rank {local_rank} binding to cpus: {cpu_list}")
    psutil.Process().cpu_affinity(cpu_list)

def print_gpu_utilization():
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    memory_used = []
    for device_index in range(device_count):
        device_handle = pynvml.nvmlDeviceGetHandleByIndex(device_index)
        device_info = pynvml.nvmlDeviceGetMemoryInfo(device_handle)
        memory_used.append(device_info.used/1024**3)
    print('Memory occupied on GPUs: ' + ' + '.join([f'{mem:.1f}' for mem in memory_used]) + ' GB.')


# Choose a model and load tokenizer and model (using 4bit quantization):
model_name = '/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/microsoft--phi-3.5-mini-instruct'
# model_name = 'microsoft/Phi-3.5-mini-instruct'
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.padding_side = 'right'

# For multi-GPU training, find out how many GPUs there are and which one we should use:
ps = PartialState()
num_processes = ps.num_processes
process_index = ps.process_index
local_process_index = ps.local_process_index
set_cpu_affinity(local_process_index)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type='nf4',
        bnb_4bit_compute_dtype=torch.bfloat16,
    ),
    device_map={'':local_process_index},  # Changed for DDP
    attn_implementation='eager',  # 'eager', 'sdpa', or "flash_attention_2"
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)

# Load the guanaco dataset
guanaco_train = load_dataset('/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/timdettmers--openassistant-guanaco', split='train')
guanaco_test = load_dataset('/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/timdettmers--openassistant-guanaco', split='test')
# guanaco_train = load_dataset('timdettmers/openassistant-guanaco', split='train')
# guanaco_test = load_dataset('timdettmers/openassistant-guanaco', split='test')
guanaco_train = guanaco_train.map(lambda entry: {
    'question1': entry['text'].split('###')[1].removeprefix(' Human: '),
    'answer1': entry['text'].split('###')[2].removeprefix(' Assistant: ')
})
guanaco_test = guanaco_test.map(lambda entry: {
    'question1': entry['text'].split('###')[1].removeprefix(' Human: '),
    'answer1': entry['text'].split('###')[2].removeprefix(' Assistant: ')
})
guanaco_train = guanaco_train.map(lambda entry: {'messages': [
    {'role': 'user', 'content': entry['question1']},
    {'role': 'assistant', 'content': entry['answer1']}
]})
guanaco_test = guanaco_test.map(lambda entry: {'messages': [
    {'role': 'user', 'content': entry['question1']},
    {'role': 'assistant', 'content': entry['answer1']}
]})

model.config.use_cache = False  # KV cache can only speed up inference, but we are doing training.

# Add low-rank adapters (LORA) to the model:
peft_config = LoraConfig(
    task_type='CAUSAL_LM',
    r=16,
    lora_alpha=32,  # thumb rule: lora_alpha should be 2*r
    lora_dropout=0.05,
    bias='none',
    target_modules='all-linear',
)
model = get_peft_model(model, peft_config)

training_arguments = SFTConfig(
    output_dir='output/phi-3.5-mini-instruct-guanaco-ddp',
    per_device_train_batch_size=8//num_processes,  # Adjust per-device batch size for DDP
    gradient_accumulation_steps=1,
    gradient_checkpointing=True, # Gradient checkpointing improves memory efficiency, but slows down training,
        # e.g. Mistral 7B with PEFT using bitsandbytes:
        # - enabled: 11 GB GPU RAM and 8 samples/second
        # - disabled: 40 GB GPU RAM and 12 samples/second
    gradient_checkpointing_kwargs={'use_reentrant': False},  # Use newer implementation that will become the default.
    ddp_find_unused_parameters=False,  # Set to False when using gradient checkpointing to suppress warning message.
    log_level_replica='error',  # Disable warnings in all but the first process.
    optim='adamw_torch',
    learning_rate=2e-4,  # QLoRA suggestions: 2e-4 for 7B or 13B, 1e-4 for 33B or 65B
    logging_strategy='no',
    # logging_strategy='steps',  # 'no', 'epoch' or 'steps'
    # logging_steps=10,
    save_strategy='no',  # 'no', 'epoch' or 'steps'
    # save_steps=2000,
    # num_train_epochs=5,
    max_steps=100,
    bf16=True,  # mixed precision training
    report_to='none',  # disable wandb
    max_seq_length=1024,
)

def formatting_func(entry):
    return tokenizer.apply_chat_template(entry['messages'], tokenize=False)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=guanaco_train,
    eval_dataset=guanaco_test,
    processing_class=tokenizer,
    formatting_func=formatting_func,
)

if process_index == 0:  # Only print in first process.
    if hasattr(trainer.model, "print_trainable_parameters"):
        trainer.model.print_trainable_parameters()

eval_result = trainer.evaluate()
if process_index == 0:
    print("Evaluation on test dataset before finetuning:")
    print(eval_result)

train_result = trainer.train()
if process_index == 0:
    print("Training result:")
    print(train_result)

eval_result = trainer.evaluate()
if process_index == 0:
    print("Evaluation on test dataset after finetuning:")
    print(eval_result)

# Print memory usage once per node:
if local_process_index == 0:
    print_gpu_utilization()

# # Save model in first process only:
# if process_index == 0:
#     trainer.save_model()

Writing phi3_guanaco_ddp.py


#### Next, we write a SLURM script (initially using 1 GPU only):

In [2]:
%%writefile run_phi3_guanaco_1gpu.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=1
#SBATCH --gpus-per-task=1  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=120GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=8  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:30:00

# Load conda:
module purge
module load anaconda3
eval "$(conda shell.bash hook)"
conda activate /leonardo/pub/userexternal/mpfister/conda_env_martin24

# Include commands in output:
set -x

# Print current time and date:
date

# Print host name:
hostname

# List available GPUs:
nvidia-smi

# Run:
time python3 phi3_guanaco_ddp.py

Writing run_phi3_guanaco_1gpu.slurm


#### We can now submit the SLURM script and, once the job ran, look at the output:

In [3]:
!sbatch run_phi3_guanaco_1gpu.slurm

Submitted batch job 13539541


In [4]:
!squeue --me

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13539541 boost_usr run_phi3 mpfister CF       0:02      1 lrdn3249
          13532631 boost_usr jupyterl mpfister  R    1:42:09      1 lrdn1789


Change the number in the command below to the JOBID of the batch job that you just submitted:

In [9]:
!cat slurm-13539541.out

+ date
Thu Mar  6 17:40:14 CET 2025
+ hostname
lrdn3249.leonardo.local
+ nvidia-smi
Thu Mar  6 17:40:14 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM-64GB            On | 00000000:8F:00.0 Off |                    0 |
| N/A   43C    P0               61W / 464W|      0MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+--------

#### Now, we write another SLURM script where use `torchrun` to train on multiple GPUs using DDP and submit the script to the scheduler again:

In [6]:
%%writefile run_phi3_guanaco_ddp.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=1
#SBATCH --gpus-per-task=2  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=240GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=16  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:30:00

# Load conda:
module purge
module load anaconda3
eval "$(conda shell.bash hook)"
conda activate /leonardo/pub/userexternal/mpfister/conda_env_martin24

# Include commands in output:
set -x

# Print current time and date:
date

# Print host name:
hostname

# List available GPUs:
nvidia-smi

# Set environment variables for communication between nodes:
export MASTER_PORT=$(shuf -i 20000-30000 -n 1)  # Choose a random port
export MASTER_ADDR=$(scontrol show hostnames ${SLURM_JOB_NODELIST} | head -n 1)
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Set launcher and launcher arguments:
export LAUNCHER="torchrun \
    --nnodes=$SLURM_JOB_NUM_NODES \
    --nproc_per_node=$SLURM_GPUS_ON_NODE \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend=c10d"
# Set training script that will be executed:
export PROGRAM="phi3_guanaco_ddp.py"

# Run:
time srun bash -c "$LAUNCHER $PROGRAM"

Writing run_phi3_guanaco_ddp.slurm


In [7]:
!sbatch run_phi3_guanaco_ddp.slurm

Submitted batch job 13539549


In [8]:
!squeue --me

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          13539549 boost_usr run_phi3 mpfister CF       0:02      1 lrdn3225
          13539541 boost_usr run_phi3 mpfister  R       0:42      1 lrdn3249
          13532631 boost_usr jupyterl mpfister  R    1:42:49      1 lrdn1789


In [10]:
!cat slurm-13539549.out

+ date
Thu Mar  6 17:40:54 CET 2025
+ hostname
lrdn3225.leonardo.local
+ nvidia-smi
Thu Mar  6 17:40:54 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM-64GB            On | 00000000:1D:00.0 Off |                    0 |
| N/A   42C    P0               62W / 462W|      0MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+--------

#### Finally, we can clean up and delete the files that we just created:

In [6]:
!rm phi3_guanaco_ddp.py run_phi3_guanaco_1gpu.slurm run_phi3_guanaco_ddp.slurm slurm-*.out

### Summary
DDP allows to speed up training through the use of multiple GPUs for models that fit the memory of a single GPU.

| Number of GPUs used | Training time |
| - | - |
| 1 GPU | 172 s |
| 2 GPUs | 107 s |
| 4 GPUs | ? |
| 8 GPUs (2 nodes) | ? |