# DeepSpeed with Accelerate

## DeepSpeed Overview
DeepSpeed is an optimization framework developed by Microsoft to simplify and enhance distributed deep learning training. The key components include:

### ZeRO (Zero Redundancy Optimizer)
DeepSpeed uses **ZeRO** to reduce memory usage and improve training efficiency. There are three levels of ZeRO:

1. **ZeRO-1 (Optimizer States Optimization)**
   - Reduces memory by partitioning optimizer states across GPUs.
   
2. **ZeRO-2 (Optimizer States + Gradient Optimization)**
   - Further reduces memory by partitioning gradients along with optimizer states.
   - Communication volume remains the same compared to **DDP (Distributed Data Parallel)**.

3. **ZeRO-3 (Optimizer States + Gradients + Parameters Optimization)**
   - Fully shards model states across GPUs, reducing memory overhead.
   - Requires additional communication, increasing overall communication volume by **1.5x** compared to DDP.

### Advanced ZeRO Features
- **ZeRO Offload**: Moves optimizer states and model parameters to CPU or NVMe memory for further memory savings.
- **ZeRO++**: Improves ZeRO-3 communication efficiency using weight quantization and hierarchical storage.
- **DeepSpeed Ulysses**: Optimized for sequential training, distributing each sample across GPUs.

## DeepSpeed with Accelerate Integration
Accelerate provides a seamless way to configure and run DeepSpeed in distributed environments.

### Method 1: Basic Integration
1. Configure DeepSpeed settings using the `accelerate config` command.
2. Launch the script with `accelerate launch`.

### Method 2: Full DeepSpeed Configuration
1. Configure DeepSpeed settings using `accelerate config`.
2. Modify the **DeepSpeed config file** (`config.json`) as needed.
3. Launch training with `accelerate launch`.

### ZeRO-3 Considerations
- `zero3_init_flag`: Used to load models efficiently.
- `zero3_save_16bit_model`: Saves models in **16-bit precision** to reduce storage.
- If `config.json` specifies `stage3_gather_16bit_weights_on_model_save`, it ensures weights are gathered correctly before saving.
- Evaluation should be performed within a no-gradient context.

### Multi-Machine & Multi-GPU Training
- In **direct launch mode**, users must specify settings like `deepspeed_hostfile`, `num_machines`, and `main_process_port`.
- Under **SLURM management**, the launcher is configured similarly to `torchrun`, specifying parameters like `num_machines`, `machine_rank`, and `main_process_ip`.
- Accelerate launch behaves similarly to `torchrun` in this case.

## Key Takeaways
- **DeepSpeed** optimizes large-scale training using **ZeRO** techniques.
- **Accelerate** simplifies DeepSpeed configuration and deployment.
- **ZeRO-3 + Offloading** can significantly reduce GPU memory usage.
- **Accelerate works with both local and multi-machine training.**

This integration allows efficient training of large-scale models on multiple GPUs with minimal manual setup.


## Launch zero2 with Method 1
Here we can use the same __ddp_accelerator.py__ script created from previous notebooks to run. Use ddp_accelerator2.py will have errors for now as deepspeed doesn't work with accelerate.accumulate(). But you can try and build on it on your own.  
Here I copied my config files to `deepspeed_config.yaml` to run

In [1]:
# !accelerate config
# !accelerate launch --config_file deepspeed_config.yaml ddp_accelerator.py

## Launch zero2 with Method 2
In this method, you can prepare the configs file (which you can do more customizations) to run. Configs and details see this [document](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed)  
Here we use config files to `deepspeed_config2.yaml` and `zero_stage2_config.json` to run

In [2]:
# !accelerate launch --config_file deepspeed_config2.yaml ddp_accelerator.py

## Launch zero3 with both Methods
For zero3 there are also some configs need to be changed

In [3]:
# !accelerate launch --config_file deepspeed_config3.yaml ddp_accelerator.py
# !accelerate launch --config_file deepspeed_config4.yaml ddp_accelerator.py

## Launch zero2 with Trainer
We can also run the previous __ddp_trainer.py__ using deepspeed configs

In [4]:
# !accelerate launch --config_file deepspeed_config.yaml ddp_trainer.py