In [3]:
from transformers import AutoModel
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live

## specify the model you want to train on your device
model = AutoModel.from_pretrained("THUDM/chatglm-6b",trust_remote_code=True) 
## estimate the memory cost (both CPU and GPU)
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=4, num_nodes=1)

Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 4 GPUs per node.
SW: Model with 6173M total params, 534M largest layer params.
  per CPU  |  per GPU |   Options
  155.23GB |   1.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
  155.23GB |   1.99GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
  137.98GB |   4.87GB | offload_param=none, offload_optimizer=cpu , zero_init=1
  137.98GB |   4.87GB | offload_param=none, offload_optimizer=cpu , zero_init=0
   11.95GB |  27.86GB | offload_param=none, offload_optimizer=none, zero_init=1
  137.98GB |  27.86GB | offload_param=none, offload_optimizer=none, zero_init=0


# 消耗时间汇总

In [None]:
1、单GPU                                ---耗时：0.8616263747215271分钟
2、DataParallel                         ---耗时：1.4458719611167907分钟
3、Distributed分布式训练                 ---耗时：0.2986536264419556分钟
4、distributed分布式训练-multiprocess启动 ---耗时：0.3739401698112488分钟
5、AMP混合精度训练                        ---耗时：0.2881103197733561分钟
6、deepspeed分布式训练                   ---耗时：1.2790814956029257分钟
7、accelerate                           ---耗时：0.3006397406260172分钟
8、transformers的Trainer分布式训练        ---43s

# 训练方式

## 单GPU训练

In [2]:
!~/anaconda3/envs/python39_p13/bin/python single-gpu-cls.py

[2023-06-21 02:37:21,958] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Downloading (…)solve/main/vocab.txt: 100%|███| 110k/110k [00:00<00:00, 1.62MB/s]
Downloading (…)in/added_tokens.json: 100%|█████| 2.00/2.00 [00:00<00:00, 816B/s]
Downloading (…)cial_tokens_map.json: 100%|██████| 112/112 [00:00<00:00, 150kB/s]
Downloading (…)okenizer_config.json: 100%|███| 19.0/19.0 [00:00<00:00, 25.6kB/s]
Downloading (…)lve/main/config.json: 100%|██████| 647/647 [00:00<00:00, 902kB/s]
Downloading pytorch_model.bin: 100%|██████████| 412M/412M [00:01<00:00, 356MB/s]
Some weights of the model checkpoint at hfl/chinese-bert-wwm-ext were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls

## DataParallel分布式训练

In [7]:
!~/anaconda3/envs/python39_p13/bin/python multi-gpu-dataparallel-cls.py

Some weights of the model checkpoint at hfl/chinese-bert-wwm-ext were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkp

## Distributed分布式训练

In [4]:
!~/anaconda3/envs/python39_p13/bin/python -m torch.distributed.launch --nnode=1 --node_rank=0 --nproc_per_node=4 multi-gpu-distributed-cls.py --local_world_size=4

and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[24386] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '2', 'WORLD_SIZE': '4', 'LOCAL_RANK': '2'}
[24385] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '4', 'LOCAL_RANK': '1'}
MASTER_ADDR 127.0.0.1
MASTER_PORT 29500
RANK 0
WORLD_SIZE 4
LOCAL_RANK 0
[24384] Initializing proces

## distributed分布式训练-multiprocess启动

In [6]:
!~/anaconda3/envs/python39_p13/bin/python multi-gpu-distributed-mp-cls.py --local_world_size=4

[33473] rank = 0, world_size = 4, n = 1, device_ids = [0] 
[33474] rank = 1, world_size = 4, n = 1, device_ids = [1] 
[33475] rank = 2, world_size = 4, n = 1, device_ids = [2] 
[33476] rank = 3, world_size = 4, n = 1, device_ids = [3] 
Some weights of the model checkpoint at hfl/chinese-bert-wwm-ext were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint 

## AMP混合精度训练

In [9]:
!~/anaconda3/envs/python39_p13/bin/python multi-gpu-distributed-mp-amp-cls.py --local_world_size=4

[44505] rank = 0, world_size = 4, n = 1, device_ids = [0] 
[44507] rank = 2, world_size = 4, n = 1, device_ids = [2] 
[44508] rank = 3, world_size = 4, n = 1, device_ids = [3] 
[44506] rank = 1, world_size = 4, n = 1, device_ids = [1] 
Some weights of the model checkpoint at hfl/chinese-bert-wwm-ext were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint 

## deepspeed分布式训练

In [11]:
!~/anaconda3/envs/python39_p13/bin/deepspeed --master_port 11222 multi-gpu-deepspeed-cls.py

[2023-06-20 09:27:40,262] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-06-20 09:27:40,664] [INFO] [runner.py:555:main] cmd = /home/ec2-user/anaconda3/envs/python39_p13/bin/python3.9 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=11222 --enable_each_rank_log=None multi-gpu-deepspeed-cls.py
[2023-06-20 09:27:41,802] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-06-20 09:27:42,172] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2023-06-20 09:27:42,173] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2023-06-20 09:27:42,173] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2023-06-20 09:27:42,173] [INFO] [launch.py:163:main] dist_world_size=4
[2023-06-20 09:27:42,173] [INFO] [launch.py:165:main] Setting CUD

## accelerate分布式训练

accelerate launch multi-gpu-accelerate-cls.py

In [15]:
!~/anaconda3/envs/python39_p13/bin/accelerate launch multi-gpu-accelerate-cls.py

[2023-06-20 09:37:42,981] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `4`
		More than one GPU was found, enabling multi-GPU training.
		If this was unintended please pass in `--num_processes=1`.
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
[2023-06-20 09:37:44,894] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-06-20 09:37:44,895] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-06-20 09:37:44,906] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-06-20 09:37:44,916] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)


## transformers的Trainer分布式训练

In [19]:
%ittime
!~/anaconda3/envs/python39_p13/bin/python multi-gpu-transformers-cls.py

[2023-06-20 09:51:07,596] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Some weights of the model checkpoint at hfl/chinese-bert-wwm-ext were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertFo

# accelerator

In [3]:
import os
from accelerate.utils import write_basic_config
write_basic_config() # Write a config file


ModuleNotFoundError: No module named 'accelerate'