# Huggingface Accelerate

In this notebook, we are going to write a Python and SLURM script directly from within cells and also launch the SLURM script.
Since we only have limited ressources, we are going to use a small dateset in combination with a small model. However, this still demonstrates how to use Huggingface's [Accelerate](https://huggingface.co/docs/accelerate/index) library to perform distributed training on multiple GPUs across multiple nodes. In our case, we are using 2 nodes with 2 NVIDIA A100 GPUs each.
The network used on LEONARDO is NVLink.

Hugging Face Accelerate simplifies distributed training with the key benefits being:

 - Automated Distributed Setup: Accelerate automatically initializes the distributed environment. It configures process groups, sets the appropriate environment variables, and assigns GPUs to processes using PyTorch's native Distributed Data Parallel (DDP). This means you don’t have to manually set up multi-GPU execution or write boilerplate code.

 - Device and Process Management: With utilities like PartialState, Accelerate provides easy access to details such as the number of processes, process indices, and local device assignments. This information is crucial for tasks like sharding data, adjusting batch sizes per GPU, and ensuring that only one process handles logging or model saving.

 - Seamless Scaling: Accelerate allows your code to run on both single and multiple GPUs with minimal modifications. Whether you're training on one GPU or several, Accelerate handles the synchronization of model parameters and gradients across devices, making your code more portable and scalable.

#### Python script
Let's go through the Python script, so you know what we are launching here. <br>
Accelerate doesn't mandate that the code be wrapped in a main() function, but it is highly recommended—especially for distributed or multi-GPU setups. Here’s why:

- Multiprocessing Safety:
    When using distributed training, processes are spawned that import your script. By wrapping your code in a main() function and using the if __name__ == "__main__": guard, you prevent unintended code execution in child processes.

- Accelerate Configuration:
The Accelerate config file often includes an entry like main_training_function: main. This instructs Accelerate to look for a function named main to kick off training. If you don’t define it, you might run into errors.

In [12]:
%%writefile run_llama_guanaco_accelerate.py
import torch
from accelerate import Accelerator
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTTrainer, SFTConfig
import pynvml

def print_gpu_utilization():
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    memory_used = []
    for device_index in range(device_count):
        device_handle = pynvml.nvmlDeviceGetHandleByIndex(device_index)
        device_info = pynvml.nvmlDeviceGetMemoryInfo(device_handle)
        memory_used.append(device_info.used / 1024**3)
    print('Memory occupied on GPUs: ' + ' + '.join([f'{mem:.1f}' for mem in memory_used]) + ' GB.')

def main():
    # Initialize Accelerator; it will auto-detect the distributed environment from SLURM
    accelerator = Accelerator()
    device = accelerator.device
    num_processes=accelerator.num_processes

    if accelerator.is_main_process:
        print(f"Running on device: {device}")

    # Define model name and load tokenizer
    # model_name = '/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/microsoft--phi-3.5-mini-instruct'
    model_name = '/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/meta-llama--Llama-3.2-1B-Instruct'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.padding_side = 'left' # 'right'

    # Load the model with 4-bit quantization. Note that we do not specify a device map manually.
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_compute_dtype=torch.bfloat16,
        ),
        attn_implementation='eager',
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
    )
    # Move the model to the device specified by Accelerator
    model.to(device)
    
    # Disable caching (only beneficial for inference)
    model.config.use_cache = False


    # Load the guanaco dataset
    guanaco_train = load_dataset('/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/timdettmers--openassistant-guanaco', split='train')
    guanaco_test = load_dataset('/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/timdettmers--openassistant-guanaco', split='test')
    # guanaco_train = load_dataset('timdettmers/openassistant-guanaco', split='train')
    # guanaco_test = load_dataset('timdettmers/openassistant-guanaco', split='test')

    def reformat_text(text, include_answer=True):
        question1 = text.split('###')[1].removeprefix(' Human: ')
        answer1 = text.split('###')[2].removeprefix(' Assistant: ')
        if include_answer:
            messages = [
                {'role': 'user', 'content': question1},
                {'role': 'assistant', 'content': answer1}
            ]
        else:
            messages = [
                {'role': 'user', 'content': question1}
            ]        
        reformatted_text = tokenizer.apply_chat_template(messages, tokenize=False)
        return reformatted_text

    # Now, apply reformat_train(..) to both datasets:
    guanaco_train = guanaco_train.map(lambda entry: {
        'reformatted_text': reformat_text(entry['text'])
    })
    guanaco_test = guanaco_test.map(lambda entry: {
        'reformatted_text': reformat_text(entry['text'])
    })

    model.config.use_cache = False  # KV cache can only speed up inference, but we are doing training.
    
    # Add low-rank adapters (LORA) to the model:
    peft_config = LoraConfig(
        task_type='CAUSAL_LM',
        r=16,
        lora_alpha=32,  # thumb rule: lora_alpha should be 2*r
        lora_dropout=0.05,
        bias='none',
        target_modules='all-linear',
    )


    training_arguments = SFTConfig(
        output_dir='output/llama-3.2-1b-instruct-guanaco-ddp',
        per_device_train_batch_size=8//num_processes,  # Adjust per-device batch size for DDP
        gradient_accumulation_steps=1,
        gradient_checkpointing=True, # Gradient checkpointing improves memory efficiency, but slows down training,
            # e.g. Mistral 7B with PEFT using bitsandbytes:
            # - enabled: 11 GB GPU RAM and 8 samples/second
            # - disabled: 40 GB GPU RAM and 12 samples/second
        gradient_checkpointing_kwargs={'use_reentrant': False},  # Use newer implementation that will become the default.
        ddp_find_unused_parameters=False,  # Set to False when using gradient checkpointing to suppress warning message.
        log_level_replica='error',  # Disable warnings in all but the first process.
        optim='adamw_torch',
        learning_rate=2e-4,  # QLoRA suggestions: 2e-4 for 7B or 13B, 1e-4 for 33B or 65B
        logging_strategy='no',
        # logging_strategy='steps',  # 'no', 'epoch' or 'steps'
        # logging_steps=10,
        save_strategy='no',  # 'no', 'epoch' or 'steps'
        # save_steps=2000,
        # num_train_epochs=5,
        max_steps=100,
        bf16=True,  # mixed precision training
        report_to='none',  # disable wandb
        max_length=1024,
        dataset_text_field='reformatted_text',
    )


    # Create the SFTTrainer.
    trainer = SFTTrainer(
        model=model,
        peft_config=peft_config,
        args=training_arguments,
        train_dataset=guanaco_train,
        eval_dataset=guanaco_test,
        processing_class=tokenizer,
    )

    # Optionally print trainable parameters on the main process only.
    if accelerator.is_main_process and hasattr(trainer.model, "print_trainable_parameters"):
        trainer.model.print_trainable_parameters()

    # Evaluate before training
    eval_result = trainer.evaluate()
    if accelerator.is_main_process:
        print("Evaluation on test dataset before finetuning:")
        print(eval_result)

    # Train the model
    train_result = trainer.train()
    if accelerator.is_main_process:
        print("Training result:")
        print(train_result)

    # Evaluate after training
    eval_result = trainer.evaluate()
    if accelerator.is_main_process:
        print("Evaluation on test dataset after finetuning:")
        print(eval_result)

    # Print GPU memory usage (only once per node)
    if accelerator.local_process_index == 0:
        print_gpu_utilization()

if __name__ == "__main__":
    main()


Overwriting phi3_guanaco_accelerate.py


#### Next, we write a SLURM script, initially using 1 GPU only and the exact same setup as with the DDP example:
While this code works and runs, it is handled differently, than when launching it with Accelerate. Note the times in the output.

In [13]:
%%writefile run_llama_guanaco_accelerate_1gpu.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=1
#SBATCH --gpus-per-task=1  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=120GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=8  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:30:00

# Include commands in output:
set -x

# Print current time and date:
date

# Print host name:
hostname

# List available GPUs:
nvidia-smi

# Construct command to run container:
export CONTAINER="singularity run --nv --home=$HOME $SINGULARITY_CONTAINER"

# Run:
time $CONTAINER python3 run_llama_guanaco_accelerate.py

Overwriting run_phi3_guanaco_accelerate_1gpu.slurm


#### We can now submit the SLURM script and, once the job ran, look at the output:

In [14]:
!sbatch run_llama_guanaco_accelerate_1gpu.slurm

Submitted batch job 19819385


In [19]:
!squeue --me

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          19819102 boost_usr trainee0 sharriso  R      15:19      1 lrdn3456


Change the number in the command below to the JOBID of the batch job that you just submitted:

In [20]:
!cat slurm-19819385.out

+ date
Wed Sep 10 21:49:35 CEST 2025
+ hostname
lrdn3394.leonardo.local
+ nvidia-smi
Wed Sep 10 21:49:35 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM-64GB           On  | 00000000:8F:00.0 Off |                    0 |
| N/A   42C    P0              60W / 458W |      2MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+-------

#### Now, we will create an Accelerate config file and adapt the SLURM script, so that Accelerate launches the program. We are still only using 1 GPU:
Note the times in the output again!

In [21]:
%%writefile accelerate_default_config_1gpu.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: NO
mixed_precision: bf16
downcast_bf16: 'yes'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
use_cpu: false

Overwriting accelerate_default_config_1gpu.yaml


In [22]:
%%writefile run_llama_guanaco_accelerate_1gpu.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=1
#SBATCH --gpus-per-task=1  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=120GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=8  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:30:00

# Include commands in output:
set -x

# Print current time and date:
date

# Print host name:
hostname

# List available GPUs:
nvidia-smi

# Set environment variables for communication between nodes:
export MASTER_PORT=$(shuf -i 20000-30000 -n 1)  # Choose a random port
export MASTER_ADDR=$(scontrol show hostnames ${SLURM_JOB_NODELIST} | head -n 1)
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Set launcher and launcher arguments:
export LAUNCHER="accelerate launch \
    --num_machines $SLURM_NNODES \
    --num_processes $((SLURM_NNODES * SLURM_GPUS_ON_NODE)) \
    --num_cpu_threads_per_process 8 \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank \$SLURM_PROCID \
    --config_file \"accelerate_default_config_1gpu.yaml\" \
    "
# Set training script that will be executed:
export PROGRAM="run_llama_guanaco_accelerate.py"

# Construct command to run container:
export CONTAINER="singularity run --nv --home=$HOME $SINGULARITY_CONTAINER"

# Run:
time srun bash -c "$CONTAINER $LAUNCHER $PROGRAM"

Overwriting run_phi3_guanaco_accelerate_1gpu.slurm


#### We can now execute the SLURM script and, once the job ran, look at the output:

In [23]:
!sbatch run_llama_guanaco_accelerate_1gpu.slurm

Submitted batch job 19819419


In [35]:
!squeue --me

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          19819102 boost_usr trainee0 sharriso  R      18:22      1 lrdn3456


In [36]:
!cat slurm-19819419.out

+ date
Wed Sep 10 21:52:47 CEST 2025
+ hostname
lrdn2761.leonardo.local
+ nvidia-smi
Wed Sep 10 21:52:47 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM-64GB           On  | 00000000:1D:00.0 Off |                    0 |
| N/A   43C    P0              68W / 488W |      2MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+-------

#### Now, we write another config file and SLURM script to train on multiple GPUs using Accelerate and submit the script to the scheduler again:

In [37]:
%%writefile accelerate_default_config_multi_gpu.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
mixed_precision: bf16
downcast_bf16: 'yes'
machine_rank: 0
main_training_function: main
num_machines: 2
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false

Overwriting accelerate_default_config_multi_gpu.yaml


In [38]:
%%writefile run_llama_guanaco_accelerate_multigpu.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=2
#SBATCH --gpus-per-task=2  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=120GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=16  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:30:00

# Include commands in output:
set -x

# Print current time and date:
date

# Print host name:
hostname

# List available GPUs:
nvidia-smi

# Set environment variables for communication between nodes:
export MASTER_PORT=$(shuf -i 20000-30000 -n 1)  # Choose a random port
export MASTER_ADDR=$(scontrol show hostnames ${SLURM_JOB_NODELIST} | head -n 1)
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Set launcher and launcher arguments:
export LAUNCHER="accelerate launch \
    --num_machines $SLURM_NNODES \
    --num_processes $((SLURM_NNODES * SLURM_GPUS_ON_NODE)) \
    --num_cpu_threads_per_process 8 \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank \$SLURM_PROCID \
    --config_file \"accelerate_default_config_multi_gpu.yaml\" \
    "
# Set training script that will be executed:
export PROGRAM="run_llama_guanaco_accelerate.py"

# Construct command to run container:
export CONTAINER="singularity run --nv --home=$HOME $SINGULARITY_CONTAINER"

# Run:
time srun bash -c "$CONTAINER $LAUNCHER $PROGRAM"

Overwriting run_phi3_guanaco_accelerate_multigpu.slurm


In [39]:
!sbatch run_llama_guanaco_accelerate_multigpu.slurm

Submitted batch job 19819444


In [49]:
!squeue --me

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          19819444 boost_usr run_phi3 sharriso CG       1:33      2 lrdn[3397-3398]
          19819102 boost_usr trainee0 sharriso  R      20:41      1 lrdn3456


In [50]:
!cat slurm-19819444.out

+ date
Wed Sep 10 21:55:57 CEST 2025
+ hostname
lrdn3397.leonardo.local
+ nvidia-smi
Wed Sep 10 21:55:57 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM-64GB           On  | 00000000:1D:00.0 Off |                    0 |
| N/A   42C    P0              64W / 483W |      2MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+-------

#### Before we close the notebook, we should clean up the files created:

In [51]:
!rm run_llama_guanaco_accelerate.py run_llama_guanaco_accelerate_1gpu.slurm run_llama_guanaco_accelerate_multigpu.slurm slurm-*.out *.yaml