# Accelerate

## I. Introdction

Accelerate doesn't provide distributed functionalities, but is an API to facilitate distributed training. It supports many distributed training libs such as DDP, FSDP, Deepspeed.

Transformers uses accelerate to help its training process.

Officially, it consists 4 steps to use accelerate according to its doc:

 step1. import accelerate:
 ```python
 from accelerate import Accelerator
 ```

 step2. instanciate accelerate object
 ```python
 accelerator = Accelerator()
 ```

 step3. wrap training objets
 ```python
 model, optimizer, dataloader, scheduler = accelerator.prepare(model, 
                                                               optimizer, 
                                                               dataloader, 
                                                               scheduler)
 ```

 step4. back prop

 ```python
 accelerator.backward(loss)
 ```


## II. How to realize

As example, we take the DDP code as the starting point and modify the code to use accelerate. This can help understand how the distributed concept was realized in accelerate.

The steps to follow:

 step1. we take the code from DDP as initial code example: ddp_train_torch.py.

 step2. refactor the code into functions (this is optional):
 
 * get_dataloader: return tokenized data
 * get_model: return the model and the corresponding optimizer

 step3. add main function: this helps structure the script and make it more maintainable.
 
 step4. do modifications to use accelerate

    A. import accelerator
    B. Initialize accelerator
    C. wrap objs
    D. accelerator backward

The above 4 steps are the accelerate basic steps. But this won't work yet, we should do other changes. Mostly, we have to get rid of all DDP stuffs.

    E. remove ddp sampler in dataloader
    F. remove the ddp wrap
    G. remove to gpu: there are 2 places where the to operation should be removed: in the train loop and in the metric function.

Until here, the script can run without error. But we will have the problem explained in DDP with the padded batch data. So the accuracy is not correct due to the incorrect data length (this length is the padded length, but not the real length).

    H. gather preds
    I. change report

 step5. All modifications are marked and exaplned in the file: accelerate_torch.py. To run do:

 ```bash
 $ torchrun --nproc_per_node=2 accelerate_torch.py
 ```

## III. Command

Accelerate provide also a command to run the script. To do so, use terminal:

```bash
$ accelerate launch your_script.py
```

If we run accelerate without parameters, the script will be run using default parameters. Otherwise, we can configure the parameters.

To configure the accelerate launcher, we do:

```bash
$ accelerate config
```

Upon enter, Some prompt questions will show in terminal, use keyboard to select (arrows), and to valid (enter) the options.

Arguments used in this context are shown below (depending on the selection, this may change): 

    * Please select a choice using the arrow or number keys, and selecting with enter
        ➔ This machine     
            AWS (Amazon SageMaker)  
    * Which type of machine are you using?                          
        Please select a choice using the arrow or number keys, and selecting with enter
        ➔  No distributed training
            multi-CPU
            multi-XPU
            multi-GPU
            multi-NPU
            multi-MLU
            TPU
    * How many different machines will you use (use more than 1 for multi-node training)? [1]:
    * Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: 
    * Do you wish to optimize your script with torch dynamo?[yes/NO]:                        
    * Do you want to use DeepSpeed? [yes/NO]:      
    * Do you want to use FullyShardedDataParallel? [yes/NO]:   
    * Do you want to use Megatron-LM ? [yes/NO]:                                       
    * How many GPU(s) should be used for distributed training? [1]:2
    * What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:
    * Would you like to enable numa efficiency? (Currently only supported on NVIDIA hardware). [yes/NO]:
    * Do you wish to use FP16 or BF16 (mixed precision)?
    * Please select a choice using the arrow or number keys, and selecting with enter
        ➔  no
        fp16
        bf16
        fp8

At the end, a message is shown: 
```bash
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml
```

We can open this file (for VSCode, use ctrl+click on the file name) and see the content of all options:

```yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

We can now run the program using: 
```bash
$ accelerate launch accelerate_torch.py 
```

To modify the config, there are some options:

     * redo step 1) to regenerate the config file
     * modify directly the config file
     * copy the config file and modify it, we then can use it using the argument --config_file" to pass the new config file to the accelerator program

To show all accelerate launcher arguments, we can use:
```bash
$ accelerate launch --help 
```

For the trainer of Transformers, nothing needed to be changed. We can run it directly using:

```bash
$ accelerate launch ddp_train_transformers.py
```


## IV. mixed precision

Mixed precision training allows to minimize memory usage or accelerate the training speed.
It uses FP32 and FP16/BF16 to train a model as shown in the schema below.
    
      fp16 -> forward -> fp16  
        |                  |  
      fp32 <- backward <- fp32

However, can it really decrease the memory usage and speed up training.

For the first question, let's take an aexample of a model with N parameters. 
According to the schema above, we can get the total mem (line total) for the mixed precision is more than 16*N bytes, while for full precision is 16*N bytes. 
So we can see that the mixed precision don't decrease the mem so far. This is due to the extra copy of the model and some overhead for gradients conversion for the mixed precision.

Hovever, if we count the activation (suppose there are A activations), the mem usage will be (16+)*N + 2*A for mixed precision and 16*N + 4*A for full precision.
This means the more the activation, the more advantageous to use mixed precision in terms of memory usage.

|   	              | mixed (MB)         | fp32 (MB)   |
|---	              |---	           |---	       |
| model             | (4 + 2) * N        | 4 *N        |
| optimizer         | 8  * N             | 8  * N      |
| gradients         | (2+)  * N          | 4  * N      |
| total  	        | (16+)  * N         | 16  * N     |
| activation        | 2 * A              | 4 * A       |
| total + activation  | (16+)* N  + 2*A  | 16* N + 4*A |

To use half precision for accelerate, there are 3 ways:

 - instanciate accelerator with mixed precision argument:

    ```python
    accelerator = Accelerator(mixed_precision="bf16")
    ```
 - Use config command: 
 accelerator config -> choose bf16
 - user launcher with mixed precision argument: 
 accelerator launch --mixed_precision bf16 train.py

However, for trainer of Tranformers, we can only use the last 2 methods to run the training with mixed precision.

conclusion:
 - the running time decreased
 - the momery usage decreases when the batch size is significant

In code, see the script accelerate_torch_mixed.py. The modifications are:

 * IV.A. set mixed precision 

## V. Gradient accumulation

To add gradient accumulation in accelerate, there are 3 steps:

 1. specify the accumulation steps while instanciate the accelerate:

    ```python
    accelerator = Accelerator(gradient_accumulation_steps=xx)
    ```

 2. add accumulation context in the training loop:

    ```python
    for batch in trainloader:
       with accelerator.accumulate(model): # <- ADD CONTEXT
           _optimizer.zero_grad()          # <- indent all lines from here
           output=_model(**batch)          # ...
           ...                             # ...
           gStep += 1                      # <- until here
    ```

 3. So far, the global step is updated with the batch, which doesn't reflect the training loop.  Instead, the step should be updated at each update of the gradients. accelerate provide a variable to indicate when to update the step. We then can use this as:

      ```python
      if accelerate.sync_gradients: # <- ADD CONDITION
         gStep += 1
         if gStep % _log_step  == 0:
             ...
         ...
      ```

In code, see the script accelerate_torch_mixed.py. The modifications are:

 * V.A. set gradient accumulation
 * V.B. accumulate context 
 * V.C. update accumulation step


For the trainer of Transformers, we just have to add the accumulation steps in the training arguments and run the training as before:

```python
args = TrainingArguments(
    output_dir="./checkpoints",
    ...
    gradient_accumulation_steps=32,
    ...
)
```

## VI. Logging

Accelerate supports all most common logging libs: Tensorboard, Wandb, CometML, Aim, MLFlow, Neptune, Visdom.

The steps are:

 1. specify the log lib and project_dir while instantiante the accelerator:

    ```python
    accelerator = Accelerator(log_with="tensorboard", project_dir="checkpoints")
    ```

 2. initiate tracker

    ```python
    accelerator.init_trackers(project_name="runs")
    ```

  3. end trancking after training finishs

      ```python
      accelerator.end_training()
      ```

 4. add log variables, we use log of accelerator, which takes a dict and an int as inputs

      ```python
      accelerator.log({"loss": loss.item()}, Step) 
      ```

 5. open tensorboard: see tensorboard doc for details. In VS Code, we can open directly the tensorboard from its control panel (ctrl+shift+p) and choose the dir to open.


In code, see the script accelerate_torch_mixed.py. The modifications are:

 * VI.A. set logging 
 * VI.B. initiate tracker
 * VI.C. end tracking
 * VI.D. log variables

For the trainer of TRansformers, add logging options in the training arguments as needed, the trainer will take care of the rest correctly.

## VII. Save

The commons files used are:
 
 * weight files (eg. .bin and .safetensors)
 * config files (eg. config.json): to describe model structure
 * misc: generation_config.json, adapter_model.safetensors

For saving, if on signle machine, we can simply use model.save_pretrained(dir). However, if distributed, the above mothod will reture error. In this case, we can use:

```python
accelerator.save_model(model, accelerator.project_dir + f"/step_{step}")
```

But, there are 2 problems with method:

 * it doesn't save the config file, only the weights.
 * if there are adapters (peft models), it will save the whole model but not the adapters.

 So another method is to use:

```python
accelerator.unwrap_model(_model).save_pretrained(...)
```


In code, see the script accelerate_torch_mixed.py. The modifications are:

 * VII.A. save


For the trainer of Tranformers, no extra things needed to be done.


## VIII. resume

The process of resuming training consists of:

 * save checkpoint regularly
 * in case of resume, load all resources such as weights, optimizer, learning rate, random state...
 * skip epochs and steps already done.

For accelerate, the steps to follow are:

 1. checkpoint:

    ```python
    accelerator.save_state()
    ```
    
After saving, there will be some files such as:
  - model weights: model.safetensors
  - optimizer: optimizer.bin
  - random state files: random_states_0.pkl, random_states_1.pkl
  - adapter files if any: adapter_config.json, adapter_model.safetensors
  - ...

 2. load

    ```python
    accelerator.load_state()
    ```

 3. compute epochs and steps to skip

 4. skip epochs and steps

    ```python
    accelerator.skip_first_batches(trainloader, resume_step)
    ```

In code, see the script accelerate_torch_mixed.py. The modifications are:

 * VIII.A. save for resume
 * VIII.B. add resume option
 * VIII.C. resume training


For the trainer in Tranformers, there are 3 ways to resume the training:
 
 * We can load the trained model and train it as done for the finetuning.
 * In the training arguments, add the resume option:

      ```python
      args = TrainingArguments(
          ...
          resume_from_checkpoint=True
      )
      ```
    This restart the saving, thus overwrites the previously saved training.

 * In the train function of the trainer, add resume option:

    ```python
    trainer.train(resume_from_checkpoint=True)
    ```
   This will continue the training until the epoch number is reached. So, after epochs training and we want to continue the training further, we should increase the epochs numbers.

## IX. Deepspeed

### A. Introduction

The doc of deepspeed is https://huggingface.co/docs/accelerate/main/en/usage_guides/deepspeed#deepspeed-resources. It explained everything of deepspeed, which implments the paper of Zero Redundancy Optimizer (ZeRO).

Deepspeed allows load, train or inference of large models on GPUS with limited capacities. It includes:
 
 * ZeRO1: Optimizer partitioning
 * ZeRO2: ZeRO1 + gradients partitioning
 * ZeRO3: ZeRO2 + parameters partitioning
 * ZeRO-Offload: resources to CPU
 * ZeRO++: optimize communication of ZeRO3
 * DeepSpeed Ulysses: data partitioning

### B. Usage

Before using it, we need to pip install it.

In practice, we could use accelerate config to activate deepspeed and accelerate launch to run the training with default arguments. However, since we would like to adapt the arguments with our use case, the easier way is to launch the accelerate with a customized config file. 
The steps to do so are:

 * generate accelerate config file by enabling deepspeed, use:
 
    ```bash
    $ accelerate config
    ```

   All other related options can be set to NO or None and the for now we use stage 2 optimization.

 * For not to crush the original config file, we copy the default config file in another location (here, the current folder), since we will modify this file to run the deepspeed in different modes.
   To copy to the current folder, we can do:

   ```bash
   $ cp /home/Qingyi/.cache/huggingface/accelerate/default_config.yaml  ./
   ```

#### B.1. default config

So far, we can run the script with ZeRO2 already. The corresponding settings in the config file are:

```yaml
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
```
Where "zero_stage" sets the deepspeed to ZeRO2.

Currently, we can either use the config file generated by accelerate, or we can use the copied config file. For the latter, we have to use an argument:

  ```bash
  $ accelerate launch --config_file default_config.yaml accelerate_torch_mixed.py.py
  ```

#### B.2. custom config

In the previous steps when generate accelerate config file, while asked providing a config file (for deepspeed, don't be confused with accelerate config file), we entered none. In the following steps, we will use a deepspeed config file.
The content of this file can be obtained from the deepspeed doc page (the link at the top of this section) and be adapted to a use case.
The steps to be followed are:

 1. comment the chendren fileds in the section "deepspeed_condig" shown previously in the accelerate config file. Also comment the line "mixed_precision: bf16".
 2. add a new entry in this section: 

    ```yaml
    deepspeed_config:
      deepspeed_config_file: zero2_config.json
    ```
 3. create "zero2_config.json" file. Copy the example of "ZeRO Stage-2 DeepSpeed Config File Example" section in the doc to a file. Be careful that the config file of deepspeed is json, whereas the config file of accelerate is yaml. We should adpat the content to our training case (see the json file for the changes.).


 For ZeRO3, we can also use a config file as for ZeRO2.

### C. Traniner

For trainer of transformers, there is almost nothing extra to do to run it using deepspeed.
However, we may encounter errors. This mostly caused by the differences of the trainer arguments and the deepspeed arguments, such as mixed precision set for deepspeed but not in the trainer arguments, gradient accumulation steps has different number for the two.

To run the example, do:

```bash
$ accelerate launch --config_file default_config.yaml ddp_train_transformers.py
```


### D. Error Handling

Deepspeed gives very obscure error messages, here is a list of possible errors, but most of them are caused by argument differences:

 1. when getting the error message "deepspeed.ops.op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)", this usually means that the cuda driver or its path is missing. There are several things can be done:
    - In linux, check the if the cuda driver can be accessed using (or which nvss which supposed to reture the installation path):
        ```bash
        $ nvcc --version
        ```
    - If you get error message as "command not found", check if the cuda driver was actually installed by looking for its installation folder (this path can vary depending the os, in my case using gentoo, look in opt instead of usr):
        ```bash
        $ ls /usr/local
        ```
    - If the cuda installation folder existes, this means that cuda driver is installed but its path is not registered in the system. So we just need to setup the correcponding environment variable.
        ```bash
        $ export CUDA_HOME=/the/path/to/cuda
        ```
    - if the cuda driver not installed, you have to install it first.

    For conda, refer to its doc for installation.

  2. There is only one mixed precision argument. Since we set "bf16" to enabled in the deepspeed config file, the same option in the accelerate config file should be commentted. Otherwise, it will issu error.

  3. The accumulation steps should be set to 1 in the training code.

  4. if the model was trained using 16bit and we use the deepspeed config file, the "mixed_precision" option should be commented out in the accelerate config file.

  5. For ZeRO3, if the model is trained using 16 bits and loading and saving are involved, the option "zero3_save_16bit_model: true" should be added to the deepspeed section.

  6. For ZeRO3, make sure to use no_grad() instead of inference_mode() in the evaluation function.