# Deepspeed QLora z3 Usecase


The key parameters of the deepspeed config file have been described below. Highlighting what the is happening throughout the training process.


`zero_stage: 3`

This enables ZeRO Stage 3 optimization, which partitions the optimizer state, gradients, and model parameters across GPUs to minimize memory consumption. This is crucial for training very large models by splitting memory-intensive components across multiple GPUs. This is related to model parallelism, as it distributes model parameters rather than duplicating them on each device`


`offload_optimizer_device: none` & `offload_param_device: none`

These options specify *where* the optimiser states and parameters are stored. In this case, no offloading to CPU or NVMe is done, meaning *the training will fully utilize the GPU memory*. 
Offloading is used to free up GPU memory, but here it's bypassed to keep things simpler.

`zero3_save_16bit_model: true`

This saves the model in 16-bit precision, helping reduce memory usage, which is key for mixed precision training. 

`mixed_precision: bf16`

Indicates that training will use bfloat16 (bf16) precision for mixed precision training. This *reduces the required memory and computation load while maintaining enough numerical accuracy* for training deep models. This improves memory efficiency, especially useful in multi-GPU setups

`num_processes: 2`

Specifies that* 2 processes will be used, typically one for each GPU*. In this context, this is part of data parallelism, where the model is duplicated across GPUs, and each process handles a portion of the data

`zero3_init_flag: true`

This is an internal flag that ensures proper initialization of ZeRO Stage 3. It's *required* when fully leveraging the capabilities of ZeRO for memory optimization

`distributed_type: DEEPSPEED`

This sets the distributed backend to DeepSpeed, which handles both *data and model parallelism*, optimizing memory usage and computation across GPUs. It works closely with ZeRO stages to distribute workloads efficiently

`num_machines: 1`

Specifies that only *one machine is used, but with multiple GPUs*. This is a typical setup for local distributed training where parallelism occurs across multiple GPUs on the same machine

## How This Relates to Data and Model Parallelism

### Data Parallelism
By using `num_processes: 2`, each GPU handles a portion of the data, but the model is duplicated across GPUs. This is achieved through DeepSpeed Distributed Data Parallel (DDP), which synchronizes gradients across GPUs during backpropagation​. [DEEPSPEED zero3](https://deepspeed.readthedocs.io/en/latest/zero3.html)

### Model Parallelism
ZeRO Stage 3 comes into play here. Instead of duplicating the model across GPUs (which consumes a lot of memory), the model parameters and optimizer states are partitioned across GPUs. This allows you to train much larger models without running out of GPU memory.​ [DEEPSPEED JSON CONFIG](https://www.deepspeed.ai/docs/config-json/)

# Run Training

In [22]:
!accelerate launch --config_file "deepspeed_config_z3_qlora.yaml" train.py \
--seed 100 \
--model_name_or_path "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--dataset_name "DFKI-SLT/cross_ner" \
--splits "train,test,validation" \
--max_seq_len 2000 \
--num_train_epochs 1 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--bf16 True \
--learning_rate 1e-4 \
--lr_scheduler_type "cosine" \
--weight_decay 1e-4 \
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--output_dir "llama-sft-qlora-dsz3" \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--gradient_accumulation_steps 2 \
--gradient_checkpointing True \
--use_reentrant True \
--dataset_text_field "content" \
--use_flash_attn True \
--use_peft_lora True \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.1 \
--lora_target_modules "all-linear" \
--use_4bit_quantization True \
--use_nested_quant True \
--bnb_4bit_compute_dtype "bfloat16" \
--bnb_4bit_quant_storage_dtype "bfloat16"


W1023 12:42:25.554000 96779 torch/distributed/elastic/multiprocessing/redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.
[2024-10-23 12:42:25,607] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to mps (auto detect)
W1023 12:42:26.319000 96779 torch/distributed/run.py:793] 
W1023 12:42:26.319000 96779 torch/distributed/run.py:793] *****************************************
W1023 12:42:26.319000 96779 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1023 12:42:26.319000 96779 torch/distributed/run.py:793] *****************************************
[2024-10-23 12:42:27,993] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to mps (auto detect)
[2024-10-23 12:42:27,993] [INFO] [real_accelerator.py:219:get_accelerator] Setting 

In [12]:
# !pip install -q pip install torch transformers datasets peft trl accelerate deepspeed bitsandbytes flash-attn --no-build-isolation

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
