# Huggingface Accelerate

In this notebook, we are going to write a Python and SLURM script directly from within cells and also launch the SLURM script.
Since we only have limited ressources, we are going to use a small dateset in combination with a small model. However, this still demonstrates how to use Huggingface's [Accelerate](https://huggingface.co/docs/accelerate/index) library to perform distributed training on multiple GPUs across multiple nodes. In our case, we are using 2 nodes with 2 NVIDIA A100 GPUs each.
The network used on LEONARDO is NVLink.

Hugging Face Accelerate simplifies distributed training with the key benefits being:

 - Automated Distributed Setup: Accelerate automatically initializes the distributed environment. It configures process groups, sets the appropriate environment variables, and assigns GPUs to processes using PyTorch's native Distributed Data Parallel (DDP). This means you don’t have to manually set up multi-GPU execution or write boilerplate code.

 - Device and Process Management: With utilities like PartialState, Accelerate provides easy access to details such as the number of processes, process indices, and local device assignments. This information is crucial for tasks like sharding data, adjusting batch sizes per GPU, and ensuring that only one process handles logging or model saving.

 - Seamless Scaling: Accelerate allows your code to run on both single and multiple GPUs with minimal modifications. Whether you're training on one GPU or several, Accelerate handles the synchronization of model parameters and gradients across devices, making your code more portable and scalable.

#### Python script
Let's go through the Python script, so you know what we are launching here. <br>
Accelerate doesn't mandate that the code be wrapped in a main() function, but it is highly recommended—especially for distributed or multi-GPU setups. Here’s why:

- Multiprocessing Safety:
    When using distributed training, processes are spawned that import your script. By wrapping your code in a main() function and using the if __name__ == "__main__": guard, you prevent unintended code execution in child processes.

- Accelerate Configuration:
The Accelerate config file often includes an entry like main_training_function: main. This instructs Accelerate to look for a function named main to kick off training. If you don’t define it, you might run into errors.

In [1]:
%%writefile phi3_guanaco_accelerate.py
import torch
from accelerate import Accelerator
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTTrainer, SFTConfig
import pynvml

def print_gpu_utilization():
    pynvml.nvmlInit()
    device_count = pynvml.nvmlDeviceGetCount()
    memory_used = []
    for device_index in range(device_count):
        device_handle = pynvml.nvmlDeviceGetHandleByIndex(device_index)
        device_info = pynvml.nvmlDeviceGetMemoryInfo(device_handle)
        memory_used.append(device_info.used / 1024**3)
    print('Memory occupied on GPUs: ' + ' + '.join([f'{mem:.1f}' for mem in memory_used]) + ' GB.')

def main():
    # Initialize Accelerator; it will auto-detect the distributed environment from SLURM
    accelerator = Accelerator()
    device = accelerator.device

    if accelerator.is_main_process:
        print(f"Running on device: {device}")

    # Define model name and load tokenizer
    model_name = '/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/microsoft--phi-3.5-mini-instruct'
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.padding_side = 'right'

    # Load the model with 4-bit quantization. Note that we do not specify a device map manually.
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type='nf4',
            bnb_4bit_compute_dtype=torch.bfloat16,
        ),
        attn_implementation='eager',
        trust_remote_code=True,
        torch_dtype=torch.bfloat16,
    )
    # Move the model to the device specified by Accelerator
    model.to(device)
    
    # Disable caching (only beneficial for inference)
    model.config.use_cache = False

    # Add LoRA adapters
    peft_config = LoraConfig(
        task_type='CAUSAL_LM',
        r=16,
        lora_alpha=32,       # rule of thumb: lora_alpha should be about 2 * r
        lora_dropout=0.05,
        bias='none',
        target_modules='all-linear',
    )
    model = get_peft_model(model, peft_config)

    # Load and preprocess the dataset
    guanaco_train = load_dataset(
        '/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/timdettmers--openassistant-guanaco', 
        split='train'
    )
    guanaco_test = load_dataset(
        '/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/timdettmers--openassistant-guanaco', 
        split='test'
    )
    # Process each example to extract the user prompt and assistant response
    guanaco_train = guanaco_train.map(lambda entry: {
        'question1': entry['text'].split('###')[1].removeprefix(' Human: '),
        'answer1': entry['text'].split('###')[2].removeprefix(' Assistant: ')
    })
    guanaco_test = guanaco_test.map(lambda entry: {
        'question1': entry['text'].split('###')[1].removeprefix(' Human: '),
        'answer1': entry['text'].split('###')[2].removeprefix(' Assistant: ')
    })
    # Restructure to a chat format expected by our formatting function
    guanaco_train = guanaco_train.map(lambda entry: {'messages': [
        {'role': 'user', 'content': entry['question1']},
        {'role': 'assistant', 'content': entry['answer1']}
    ]})
    guanaco_test = guanaco_test.map(lambda entry: {'messages': [
        {'role': 'user', 'content': entry['question1']},
        {'role': 'assistant', 'content': entry['answer1']}
    ]})

    # Define training arguments with SFTConfig.
    # Note: We use accelerator.num_processes to adjust the per-device batch size.
    training_arguments = SFTConfig(
        output_dir='output/phi-3.5-mini-instruct-guanaco-ddp',
        per_device_train_batch_size=8 // accelerator.num_processes,
        gradient_accumulation_steps=1,
        gradient_checkpointing=True,
        gradient_checkpointing_kwargs={'use_reentrant': False},
        ddp_find_unused_parameters=False,
        log_level_replica='error',
        optim='adamw_torch',
        learning_rate=2e-4,
        logging_strategy='no',
        save_strategy='no',
        max_steps=100,
        bf16=True,
        report_to='none',
        max_seq_length=1024,
    )

    def formatting_func(entry):
        return tokenizer.apply_chat_template(entry['messages'], tokenize=False)

    # Create the SFTTrainer.
    trainer = SFTTrainer(
        model=model,
        args=training_arguments,
        train_dataset=guanaco_train,
        eval_dataset=guanaco_test,
        processing_class=tokenizer,
        formatting_func=formatting_func,
    )

    # Optionally print trainable parameters on the main process only.
    if accelerator.is_main_process and hasattr(trainer.model, "print_trainable_parameters"):
        trainer.model.print_trainable_parameters()

    # Evaluate before training
    eval_result = trainer.evaluate()
    if accelerator.is_main_process:
        print("Evaluation on test dataset before finetuning:")
        print(eval_result)

    # Train the model
    train_result = trainer.train()
    if accelerator.is_main_process:
        print("Training result:")
        print(train_result)

    # Evaluate after training
    eval_result = trainer.evaluate()
    if accelerator.is_main_process:
        print("Evaluation on test dataset after finetuning:")
        print(eval_result)

    # Print GPU memory usage (only once per node)
    if accelerator.local_process_index == 0:
        print_gpu_utilization()

if __name__ == "__main__":
    main()


Writing phi3_guanaco_accelerate.py


#### Next, we write a SLURM script, initially using 1 GPU only and the exact same setup as with the DDP example:
While this code works and runs, it is handled differently, than when launching it with Accelerate. Note the times in the output.

In [9]:
%%writefile run_phi3_guanaco_accelerate_1gpu.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=1
#SBATCH --gpus-per-task=1  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=120GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=8  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:30:00

# Load conda:
module purge
module load anaconda3
eval "$(conda shell.bash hook)"
conda activate /leonardo/pub/userexternal/mpfister/conda_env_martin24

# Include commands in output:
set -x

# Print current time and date:
date

# Print host name:
hostname

# List available GPUs:
nvidia-smi

# Run:
time python3 phi3_guanaco_accelerate.py

Overwriting accelerate_run_phi3_guanaco_1gpu.slurm


#### We can now submit the SLURM script and, once the job ran, look at the output:

In [10]:
!sbatch run_phi3_guanaco_accelerate_1gpu.slurm

Submitted batch job 12961431


In [14]:
!squeue --me

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          12960270 boost_usr jupyter. sharriso  R    2:05:38      1 lrdn0151
          12961431 boost_usr accelera sharriso  R       0:30      1 lrdn0675


Change the number in the command below to the JOBID of the batch job that you just submitted:

In [18]:
!cat slurm-12961431.out

+ date
Tue Feb 25 00:06:29 CET 2025
+ hostname
lrdn0675.leonardo.local
+ nvidia-smi
Tue Feb 25 00:06:29 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM-64GB            On | 00000000:1D:00.0 Off |                    0 |
| N/A   43C    P0               64W / 475W|      0MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+--------

#### Now, we will create an Accelerate config file and adapt the SLURM script, so that Accelerate launches the program. We are still only using 1 GPU:
Note the times in the output again!

In [19]:
%%writefile accelerate_default_config_1gpu.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: NO
mixed_precision: bf16
downcast_bf16: 'yes'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
use_cpu: false

Writing ./tooling/config/accelerate_default_config_1gpu.yaml


In [32]:
%%writefile run_phi3_guanaco_accelerate_1gpu.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=1
#SBATCH --gpus-per-task=1  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=120GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=8  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:30:00

# Load conda:
module purge
module load anaconda3
eval "$(conda shell.bash hook)"
conda activate /leonardo/pub/userexternal/mpfister/conda_env_martin24

# Include commands in output:
set -x

# Print current time and date:
date

# Print host name:
hostname

# List available GPUs:
nvidia-smi

# Set environment variables for communication between nodes:
export MASTER_PORT=$(shuf -i 20000-30000 -n 1)  # Choose a random port
export MASTER_ADDR=$(scontrol show hostnames ${SLURM_JOB_NODELIST} | head -n 1)
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Set launcher and launcher arguments:
export LAUNCHER="accelerate launch \
    --num_machines $SLURM_NNODES \
    --num_processes $((SLURM_NNODES * SLURM_GPUS_ON_NODE)) \
    --num_cpu_threads_per_process 8 \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank \$SLURM_PROCID \
    --config_file \"accelerate_default_config_1gpu.yaml\" \
    "
# Set training script that will be executed:
export PROGRAM="phi3_guanaco_accelerate.py"

# Run:
time srun bash -c "$LAUNCHER $PROGRAM"

Overwriting run_phi3_guanaco_accelerate_1gpu.slurm


#### We can now execute the SLURM script and, once the job ran, look at the output:

In [33]:
!sbatch run_phi3_guanaco_accelerate_1gpu.slurm

Submitted batch job 12961598


In [35]:
!squeue --me

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          12961598 boost_usr run_phi3 sharriso  R       0:31      1 lrdn0151
          12960270 boost_usr jupyter. sharriso  R    2:50:54      1 lrdn0151


In [39]:
!cat slurm-12961598.out

+ date
Tue Feb 25 00:51:44 CET 2025
+ hostname
lrdn0151.leonardo.local
+ nvidia-smi
Tue Feb 25 00:51:44 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM-64GB            On | 00000000:1D:00.0 Off |                    0 |
| N/A   42C    P0               63W / 479W|      0MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+--------

#### Now, we write another config file and SLURM script to train on multiple GPUs using Accelerate and submit the script to the scheduler again:

In [40]:
%%writefile accelerate_default_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
mixed_precision: bf16
downcast_bf16: 'yes'
machine_rank: 0
main_training_function: main
num_machines: 2
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false

Overwriting ./tooling/config/accelerate_default_config.yaml


In [41]:
%%writefile run_phi3_guanaco_accelerate_multigpu.slurm
#!/bin/bash

#SBATCH --partition=boost_usr_prod
# #SBATCH --qos=boost_qos_dbg
#SBATCH --account=EUHPC_D20_063
#SBATCH --reservation=s_tra_ncc

## Specify resources:
## Leonardo Booster: 32 CPU cores and 4 GPUs per node => request 8 * number of GPUs CPU cores
## Leonardo Booster: 512 GB in total => request approx. 120 GB * number of GPUs requested
#SBATCH --nodes=2
#SBATCH --gpus-per-task=2  # up to 4 on Leonardo
#SBATCH --ntasks-per-node=1  # always 1
#SBATCH --mem=120GB  # should be 120GB * gpus-per-task on Leonardo
#SBATCH --cpus-per-task=16  # should be 8 * gpus-per-task on Leonardo

#SBATCH --time=0:30:00

# Load conda:
module purge
module load anaconda3
eval "$(conda shell.bash hook)"
conda activate /leonardo/pub/userexternal/mpfister/conda_env_martin24

# Include commands in output:
set -x

# Print current time and date:
date

# Print host name:
hostname

# List available GPUs:
nvidia-smi

# Set environment variables for communication between nodes:
export MASTER_PORT=$(shuf -i 20000-30000 -n 1)  # Choose a random port
export MASTER_ADDR=$(scontrol show hostnames ${SLURM_JOB_NODELIST} | head -n 1)
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# Set launcher and launcher arguments:
export LAUNCHER="accelerate launch \
    --num_machines $SLURM_NNODES \
    --num_processes $((SLURM_NNODES * SLURM_GPUS_ON_NODE)) \
    --num_cpu_threads_per_process 8 \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank \$SLURM_PROCID \
    --config_file \"accelerate_default_config.yaml\" \
    "
# Set training script that will be executed:
export PROGRAM="phi3_guanaco_accelerate.py"

# Run:
time srun bash -c "$LAUNCHER $PROGRAM"

Writing run_phi3_guanaco_accelerate_multigpu.slurm


In [42]:
!sbatch run_phi3_guanaco_accelerate_multigpu.slurm

Submitted batch job 12961611


In [43]:
!squeue --me

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          12961611 boost_usr run_phi3 sharriso  R       0:06      2 lrdn[1065-1066]
          12960270 boost_usr jupyter. sharriso  R    3:07:15      1 lrdn0151


In [45]:
!cat slurm-12961611.out

+ date
Tue Feb 25 01:08:31 CET 2025
+ hostname
lrdn1065.leonardo.local
+ nvidia-smi
Tue Feb 25 01:08:31 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM-64GB            On | 00000000:1D:00.0 Off |                    0 |
| N/A   42C    P0               64W / 485W|      0MiB / 65536MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+--------

#### Before we close the notebook, we should clean up the files created:

In [46]:
!rm phi3_guanaco_accelerate.py run_phi3_guanaco_accelerate_1gpu.slurm run_phi3_guanaco_accelerate_multigpu.slurm slurm-*.out *.yaml

rm: cannot remove '*.yaml': No such file or directory
