Documentation:

1. [Deepspeed training documentation](https://www.deepspeed.ai/training/)
1. [Deepspeed HuggingFace integration](https://huggingface.co/docs/transformers/deepspeed)
1. [Deepspeed BERT example](https://www.deepspeed.ai/tutorials/bert-pretraining/)
1. [Deepspeed examples](https://github.com/microsoft/DeepSpeedExamples/tree/master/training)
1. [Deepspeed Megatron BERT example](https://github.com/microsoft/Megatron-DeepSpeed/tree/main/examples_deepspeed/bert_with_pile)

For Deepspeed to work:
```bash
sudo apt-get update
sudo apt-get install libaio-dev
```
In the conda env:
```bash
conda install -c conda-forge libstdcxx-ng
```

In [1]:
import os
from transformers import (
    AutoConfig,
    AutoModel,
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)
from datasets import load_dataset
import deepspeed

[2024-11-14 14:51:06,117] [INFO] [real_accelerator.py:219:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/home/olokshyn/anaconda3/envs/tu/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `std::runtime_error::~runtime_error()@GLIBCXX_3.4'
/home/olokshyn/anaconda3/envs/tu/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `__gxx_personality_v0@CXXABI_1.3'
/home/olokshyn/anaconda3/envs/tu/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `std::ostream::tellp()@GLIBCXX_3.4'
/home/olokshyn/anaconda3/envs/tu/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `std::chrono::_V2::steady_clock::now()@GLIBCXX_3.4.19'
/home/olokshyn/anaconda3/envs/tu/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `std::string::_M_replace_aux(unsigned long, unsigned long, unsigned long, char)@GLIBCXX_3.4'
/home/olokshyn/anaconda3/envs/tu/compiler_compat/ld: /usr/local/cuda/lib64/libcufile.so: undefined reference to `typeinfo for bool@CXXABI_1.3'
/home/olokshyn/anaconda3/env

In [2]:
!free -m
!nvidia-smi

               total        used        free      shared  buff/cache   available
Mem:           31879        5240       20094         106        7113       26639
Swap:          15258        2715       12543
Thu Nov 14 14:51:07 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4070 ...    Off |   00000000:01:00.0  On |                  N/A |
| N/A   55C    P5             12W /  115W |     520MiB /   8188MiB |     26%      Default |
|                        

In [3]:
def estimate_memory_requirements(model_id: str, num_gpus_per_node: tuple[int, ...] = (1,)) -> None:
    config = AutoConfig.from_pretrained(model_id)
    model = AutoModel.from_config(config)
    for n_gpus in num_gpus_per_node:
        print(f"ZERO2 with {n_gpus} GPUs")
        deepspeed.runtime.zero.stage_1_and_2.estimate_zero2_model_states_mem_needs_all_live(model, num_gpus_per_node=n_gpus)
        print(f"ZERO3 with {n_gpus} GPUs")
        deepspeed.runtime.zero.stage3.estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=n_gpus)
        print("\n\n")

In [4]:
estimate_memory_requirements("gpt2", num_gpus_per_node=(1, 2))

ZERO2 with 1 GPUs
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 124M total params.
  per CPU  |  per GPU |   Options
    2.78GB |   0.23GB | offload_optimizer=OffloadDeviceEnum.cpu
    0.70GB |   2.32GB | offload_optimizer=none
ZERO3 with 1 GPUs
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 124M total params, 38M largest layer params.
  per CPU  |  per GPU |   Options
    3.13GB |   0.14GB | offload_param=OffloadDeviceEnum.cpu, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=1
    3.13GB |   0.14GB | offload_param=OffloadDeviceEnum.cpu, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=0
    2.78GB |   0.38GB | offload_param=none, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=1
    2.78GB |   0.38GB | offload_param=none, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=0
    0.22GB |   2.23GB | offload_param=none, offload

In [5]:
estimate_memory_requirements("gpt2-large", num_gpus_per_node=(1, 2))

ZERO2 with 1 GPUs
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 774M total params.
  per CPU  |  per GPU |   Options
   17.30GB |   1.44GB | offload_optimizer=OffloadDeviceEnum.cpu
    4.33GB |  14.42GB | offload_optimizer=none
ZERO3 with 1 GPUs
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 774M total params, 64M largest layer params.
  per CPU  |  per GPU |   Options
   19.46GB |   0.24GB | offload_param=OffloadDeviceEnum.cpu, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=1
   19.46GB |   0.24GB | offload_param=OffloadDeviceEnum.cpu, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=0
   17.30GB |   1.68GB | offload_param=none, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=1
   17.30GB |   1.68GB | offload_param=none, offload_optimizer=OffloadDeviceEnum.cpu, zero_init=0
    0.36GB |  13.22GB | offload_param=none, offload

In [6]:
#  estimate_memory_requirements("gpt2-xl", num_gpus_per_node=(1, 2))

In [7]:
model_id = "gpt2-large"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

In [8]:
dataset = load_dataset("allenai/c4", "realnewslike")
train_subset = dataset["train"].select(range(10000))
val_subset = dataset["validation"].select(range(10000))

Resolving data files:   0%|          | 0/1024 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/512 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/76 [00:00<?, ?it/s]

In [9]:
def tokenize_data(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
    )

In [10]:
train_tokenized = train_subset.map(tokenize_data, batched=True)
val_tokenized = val_subset.map(tokenize_data, batched=True)

In [11]:
len(train_tokenized[0]['input_ids'])

1024

In [12]:
model.config.max_position_embeddings

1024

In [13]:
%%bash
mkdir -p out/deepspeed
cat <<'EOT' > out/deepspeed/ds_config_zero2.json
{
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true,
        "round_robin_gradients": true
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
   },
   "scheduler": {
        "type": "WarmupDecayLR",
        "params": {
            "total_num_steps": "auto",
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "communication_data_type": "fp32",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto"
}
EOT

In [None]:
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "9994"  # modify if RuntimeError: Address already in use
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"


training_args = TrainingArguments(
    output_dir="out/deepspeed/model",
    eval_strategy="epoch",
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    learning_rate=1e-3,
    num_train_epochs=3,
    save_steps=3000,
    fp16=True,
    fp16_backend="amp",
    logging_dir="out/deepspeed/model/logs",
    logging_strategy="steps",
    logging_steps=10,  # Log every 1 steps
    deepspeed="out/deepspeed/ds_config_zero2.json",
    gradient_accumulation_steps=4,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=val_tokenized,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

[2024-11-14 14:51:30,577] [INFO] [comm.py:652:init_distributed] cdb=None
[2024-11-14 14:51:30,577] [INFO] [comm.py:683:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl




Installed CUDA version 12.6 does not match the version torch was compiled with 12.1 but since the APIs are compatible, accepting this combination


Using /home/olokshyn/.cache/torch_extensions/py311_cu121 as PyTorch extensions root...
Emitting ninja build file /home/olokshyn/.cache/torch_extensions/py311_cu121/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)


ninja: no work to do.
Time to load cpu_adam op: 2.2847511768341064 seconds


Loading extension module cpu_adam...




  0%|          | 0/7500 [00:00<?, ?it/s]

{'loss': 5.1326, 'grad_norm': 3.1552038192749023, 'learning_rate': 0.0009990664177113896, 'epoch': 0.0}
{'loss': 5.4104, 'grad_norm': 1.8196520805358887, 'learning_rate': 0.0009977327287276608, 'epoch': 0.01}
{'loss': 5.1246, 'grad_norm': 1.1459987163543701, 'learning_rate': 0.0009965324086423047, 'epoch': 0.01}
{'loss': 5.1964, 'grad_norm': 2.217322826385498, 'learning_rate': 0.0009951987196585757, 'epoch': 0.02}
{'loss': 5.1244, 'grad_norm': 1.3525241613388062, 'learning_rate': 0.0009938650306748466, 'epoch': 0.02}
{'loss': 5.1413, 'grad_norm': 1.3980519771575928, 'learning_rate': 0.0009925313416911177, 'epoch': 0.02}
{'loss': 5.1371, 'grad_norm': 1.1017670631408691, 'learning_rate': 0.0009911976527073887, 'epoch': 0.03}
{'loss': 5.1987, 'grad_norm': 1.5952155590057373, 'learning_rate': 0.0009898639637236596, 'epoch': 0.03}
{'loss': 5.2399, 'grad_norm': 2.1169748306274414, 'learning_rate': 0.0009885302747399307, 'epoch': 0.04}
{'loss': 5.3187, 'grad_norm': 1.5867396593093872, 'learni