support deepspeed #1101

BootsofLagrangian · 2024-02-03T18:51:49Z

Test Done!!

Introduction

This PR adds DeepSpeed support via Accelerate to sd-scripts, aiming to improve multi-GPU training with ZeRO-Stage. I've made these changes in my fork under the branch-deepspeed and I'm open to any feedback!

0. Environment

Linux 3.10.0-1160.15.2.el7.x86_64
Under anaconda environment
Windows support is partially supported with DeepSpeed (Microsoft said). so, NOT TESTED!

1. Install DeepSpeed

First, activate your virtual environment and install DeepSpeed with the following command:

DS_BUILD_OPS=0 pip install deepspeed

2. Configure Accelerate

You can easily set up your environment for DeepSpeed with accelerate config. It allows you to control basic DeepSpeed environment variables. You can also use command-line arguments for configuration. Here's how you can set up for ZeRO-2 stage using Accelerate:

(deepspeed) accelerate config
In which compute environment are you running? **This machine**
Which type of machine are you using? **multi-GPU**
How many different machines will you use (use more than 1 for multi-node training)? [1]: **1**
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: **NO**
Do you wish to optimize your script with torch dynamo?[yes/NO]: **NO**
Do you want to use DeepSpeed? [yes/NO]: **yes**
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: **NO**
What should be your DeepSpeed's ZeRO optimization stage? **2**
How many gradient accumulation steps you're passing in your script? [1]: **1**
Do you want to use gradient clipping? [yes/NO]: **NO**
How many GPU(s) should be used for distributed training? [1]: **4**
Do you wish to use FP16 or BF16 (mixed precision)? **bf16**
accelerate configuration saved at ~/.cache/huggingface/accelerate/default_config.yaml

Follow the prompts to select your environment settings, including using multi-GPU, enabling DeepSpeed, and setting ZeRO optimization stage to 2.

Your configuration will be saved in a YAML file, similar to the following example (path and values may vary):

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  zero_stage: 2
...

3. Use in Your Scripts

toml Configuration File

Add deepspeed=true and zero_stage=[zero_stage] to your toml config file. Refer to the ZeRO-stage and Accelerate DeepSpeed documentation for more details.

Bash Argument

Simply add --deepspeed --zero_stage=[zero_stage] to your script's command line arguments.
1, 2, and 3 can be [zero_stage]

CPU/NVMe offloading

offload_optimizer_device = "cpu|nvme"
offload_param_device = "cpu|nvme"
offload_optimizer_nvme_path = "/path/to/offloading"
offload_param_nvme_path= "/path/to/offloading"

Add this argument in your toml or bash/batch scripts arguments.

full_fp16 training

DeepSpeed supports fp16_master_weights_and_gradients during training. But I think that it is not recommended and can run under restricted configuration. Only activated when optimizer is CPUAdam and ZeRO-2 stage.

Note

This PR aims to improve training efficiency in multi-GPU setups. It's been tested only in Linux environments and specifically for multi-GPU configurations. The DeepSpeed supports in Accelerate is still experimental, so please keep this in mind and feel free to provide feedback or comments on this PR.

Test Done!!

FurkanGozukara · 2024-02-04T20:04:28Z

hello.

so we add --deepspeed --zero_stage=[your_stage]

but what is your_stage here? thank you

what is zero_stage?

BootsofLagrangian · 2024-02-05T07:04:31Z

hello.

so we add --deepspeed --zero_stage=[your_stage]

but what is your_stage here? thank you

what is zero_stage?

your stage means one of ZeRO stage, [0, 1, 2, 3]. Details are in ZeRO documents.
Shortly

ZeRO stage 1 : The optimizer states
ZeRO stage 2 : The optimizer states + The gradient states
ZeRO stage 3 : The optimizer states + The gradient states + The parameters

to shard them into multi-gpus.

FurkanGozukara · 2024-02-05T11:43:02Z

hello.
so we add --deepspeed --zero_stage=[your_stage]
but what is your_stage here? thank you
what is zero_stage?

your stage means one of ZeRO stage, [0, 1, 2, 3]. Details are in ZeRO documents. Shortly

ZeRO stage 1 : The optimizer states

ZeRO stage 2 : The optimizer states + The gradient states

ZeRO stage 3 : The optimizer states + The gradient states + The parameters

to shard them into multi-gpus.

for 2 gpus which one you suggest? like --deepspeed --zero_stage=1?

what about 3 gpus?

currently it clone entire training on each gpus as far as i know

so can you tell some suggested guidelines?

BootsofLagrangian · 2024-02-05T12:21:26Z

hello.
so we add --deepspeed --zero_stage=[your_stage]
but what is your_stage here? thank you
what is zero_stage?

your stage means one of ZeRO stage, [0, 1, 2, 3]. Details are in ZeRO documents. Shortly

ZeRO stage 1 : The optimizer states

ZeRO stage 2 : The optimizer states + The gradient states

ZeRO stage 3 : The optimizer states + The gradient states + The parameters

to shard them into multi-gpus.

for 2 gpus which one you suggest? like --deepspeed --zero_stage=1?

what about 3 gpus?

currently it clone entire training on each gpus as far as i know

so can you tell some suggested guidelines?

I think that ZeRO-2 stage is optimal stage if you have enough amount of VRAM. i.e. --deepspeed --zero_stage=2 only.

Without offloading, total amount of VRAM is still a major factor to decide training parameters.

You have to choose between training speed and saving VRAM, which will require a trade-off.

The number of GPUs is important, but deciding on the ZeRO stage is a strategy to choose the method you like better.

And if you want to use cpu/nvme offload, you might meet some kind of error in this commit. I will fix it soon.

But you can use ZeRO-2 stage without offloading in this commit.

FurkanGozukara · 2024-02-07T10:49:38Z

when i tested fp16 and bf16 on SD 1.5 I had horrible results. are you able to get any decent results?

…ccelerate settings

mchdks · 2024-02-10T23:10:16Z

any update here? @kohya-ss

BootsofLagrangian · 2024-02-19T08:12:08Z

Here is a report of deepspeed in sd-scripts

Environment

Model

Stable Diffusion XL 1.0

GPUs

24GB VRAM. No NVLink
1. 2 x RTX 3090, 4 x RTX 3090, 6 x RTX 3090, and 8 x RTX 3090
2. 2 x RTX 4090, 4 x RTX 4090, and 6 x RTX 4090
48GB VRAM. No NVLink
1. 2 x A6000, 4 x A6000, 6 x A6000, and 8 x A6000

No NVLink make bottleneck among GPU communication.

Requirements

torch 2.2.0 cu121
bitsandbytes 0.0.42

AdamW8bit

accelerate 0.25.0
deepspeed 0.13.1
lycoris 2.0.2

LoCon

Training Settings

Methodology
1. Full Finetuning(FT). sdxl_train.py
2. LoCon with rank=16(PEFT). sdxl_train_network.py
Training Precision
1. full_fp16, only active for DDP
2. full_bf16
3. bf16
Distributed Method
1. Distributed Data Parallel(DDP). sd-scripts and accelerate default
2. ZeRO. stage 1, 2, and 3
- In both full_bf16 and bf16 training, I also tested ZeRO stage 2 with optimizer cpu offloading.
- If deepspeed engine face OOM, they try to move variables from VRAM to RAM. i.e. if no OOM, it same as ZeRO-2.
Resolution
1. 1024x1024 with aspect ratio bucketing(ARB)
Batch and Optimizer
1. batch size = 4
2. gradient accumulation steps = 16
3. AdamW8bit

With 24GB VRAM, sd-scripts can run barely PEFT, and can not run FT on DDP.

Experiment

Fisrt, I'm sorry for some missing element of table. My budget is limited.

Lower is better.

full_fp16, FT

Average VRAM usage(MB)

GPU Name	# of GPUs	DDP	ZeRO-1	ZeRO-2	ZeRO-3
RTX 3090	2	OOM	OOM	OOM	OOM
	4	OOM	OOM	OOM	21539
	6	OOM	OOM	OOM	21569
	8	OOM	OOM	23400	21013
RTX 4090	2	OOM	OOM	OOM	OOM
	4	OOM	OOM	OOM	21154
	6	OOM	OOM	OOM	21571
A6000	2	42672	46391	45084	45277
	4	43007	40060	36713	40912
	6	41655	39553	36725	41537
	8	42609	33300	29475	30474

Training Speed(s/it)

GPU Name	# of GPUs	DDP	ZeRO-1	ZeRO-2	ZeRO-3
RTX 3090	2	-	-	-	-
	4	-	-	-	191.73
	6	-	-	-	477.72
	8	-	-	143.58	190.09
RTX 4090	2	-	-	-	-
	4	-	-	-	50.02
	6	-	-	-	104.65
A6000	2	42.03	37.35	43.61	45.80
	4	46.18	37.90	45.64	49.48
	6	47.98	38.42	48.32	104.51
	8	40.97	35.51	45.59	46.63

full_fp16, PEFT

Average VRAM usage(MB)

GPU Name	# of GPUs	DDP	ZeRO-1	ZeRO-2	ZeRO-3
RTX 3090	2	23282	22097	22026	21045
	4	23113	23206	23154	20968
	6	22764	23280	23034	21949
	8	23109	22582	SIGABRT*	21955
RTX 4090	2	22450	22163	22175	20791
	4	22720	23059	23528	22209
	6	21767	OOM*	23167	22142
A6000	2	37697	44323	43847	26219
	4	35612	41004	38782	31543
	6	28099	34084	34044	31085
	8	30579	31521	31513	30701

Training Speed(s/it)

GPU Name	# of GPUs	DDP	ZeRO-1	ZeRO-2	ZeRO-3
RTX 3090	2	58.64	51.70	56.25	114.71
	4	59.27	55.92	58.37	167.77
	6	58.32	52.61	57.07	375.46
	8	53.08	48.88	SIGABRT*	165.85
RTX 4090	2	30.33	29.23	29.76	66.43
	4	32.36	29.42	31.17	75.34
	6	32.91	-	31.91	112.83
A6000	2	45.89	41.96	42.80	70.17
	4	48.72	42.47	43.39	79.54
	6	46.86	41.24	44.93	117.68
	8	45.11	39.91	39.58	84.05

*SIGABRT occurred. IDK why.
*Ideally this OOM should not to be happened.

full_bf16, FT

Average VRAM usage(MB)

GPU Name	# of GPUs	DDP	ZeRO-1	ZeRO-2	ZeRO-3	ZeRO-2 Optimizer cpu offloading
RTX 3090	2	OOM	OOM	OOM	OOM	23553
	4	OOM	OOM	OOM	22206	23534
	6	OOM	OOM	OOM	22059	23435
	8	OOM	OOM	OOM*	21185	23388
RTX 4090	2	OOM	OOM	OOM	OOM	23709
	4	OOM	OOM	OOM	21646	23689
	6	OOM	OOM	OOM	20867	23605
A6000	2	42673	46368	45117	42914	42320
	4	42995	39963	36662	39542	33359
	6	41653	39357	36639	40631	31007
	8	41953	33268	29594	30988	32420

Training Speed(s/it)

GPU Name	# of GPUs	DDP	ZeRO-1	ZeRO-2	ZeRO-3	ZeRO-2 Optimizer cpu offloading
RTX 3090	2	-	-	-	-	112.18
	4	-	-	-	113.79	123.54
	6	-	-	-	481.29	183.95
	8	-	-	OOM*	192.69	167.70
RTX 4090	2	-	-	-	-	45.24
	4	-	-	-	50.01	48.54
	6	-	-	-	105.13	48.30
A6000	2	42.79	35.99	42.76	45.39	57.25
	4	42.61	38.42	47.54	46.59	60.92
	6	43.09	40.05	49.84	103.77	60.39
	8	47.67	36.09	48.59	50.04	60.16

*Ideally this OOM should not to be happened.

full_bf16, PEFT

Average VRAM usage(MB)

GPU Name	# of GPUs	DDP	ZeRO-1	ZeRO-2	ZeRO-3	ZeRO-2 Optimizer cpu offloading
RTX 3090	2	22413	22099	22016	20786	21830
	4	23118	23204	23139	20668	23010
	6	22736	23377	22971	21960	23056
	8	23192	22576	22575	21767	23056
RTX 4090	2	22432	22146	22175	20896	22001
	4	22715	23077	23247	21329	23150
	6	21570	OOM*	23170	22618	23175
A6000	2	37674	43901	43813	26004	43606
	4	35439	40972	38718	32233	38611
	6	28098	34106	34011	31077	33971
	8	30590	31429	32644	30472	33516

Training Speed(s/it)

GPU Name	# of GPUs	DDP	ZeRO-1	ZeRO-2	ZeRO-3	ZeRO-2 Optimizer cpu offloading
RTX 3090	2	57.72	52.67	54.85	108.90	55.82
	4	57.01	52.24	55.58	123.89	56.76
	6	57.33	52.95	57.02	409.32	57.65
	8	53.62	49.53	52.23	167.19	52.57
RTX 4090	2	34.38	28.74	29.97	69.83	34.20
	4	31.66	30.12	30.46	75.44	30.93
	6	32.48	-	32.68	110.62	34.28
A6000	2	46.19	40.35	41.49	67.89	43.10
	4	45.88	41.03	42.82	76.99	43.41
	6	46.30	42.03	43.75	117.78	43.91
	8	45.93	37.86	54.25	87.34	42.61

*Ideally this OOM should not to be happened.

bf16, FT

Average VRAM usage(MB)

GPU Name	# of GPUs	DDP	ZeRO-1	ZeRO-2	ZeRO-3	ZeRO-2 Optimizer cpu offloading
RTX 3090	2	-	-	-	-	23549
	4	-	-	-	22041	23509
	6	-	-	-	21477	23454
	8	-	-	-	20672	23398
RTX 4090	2	-	-	-	-	23712
	4	-	-	-	21620	23627
	6	-	-	-	21234	23660
A6000	2	-	46397	45082	42974	42320
	4	-	40096	36711	39840	33405
	6	-	39357	36680	40632	31034
	8	-	33289	29604	30599	32457

Training Speed(s/it)

GPU Name	# of GPUs	DDP	ZeRO-1	ZeRO-2	ZeRO-3	ZeRO-2 Optimizer cpu offloading
RTX 3090	2	-	-	-	-	172.08
	4	-	-	-	216.22	195.97
	6	-	-	-	506.07	189.02
	8	-	-	-	189.56	165.13
RTX 4090	2	-	-	-	-	44.25
	4	-	-	-	53.28	52.91
	6	-	-	-	110.34	56.24
A6000	2	-	36.68	46.37	45.86	61.53
	4	-	41.75	46.25	48.04	61.95
	6	-	41.84	49.43	104.32	65.88
	8	-	35.50	45.23	47.68	56.39

bf16, PEFT

Average VRAM usage(MB)

GPU Name	# of GPUs	DDP	ZeRO-1	ZeRO-2	ZeRO-3	ZeRO-2 Optimizer cpu offloading
RTX 3090	2	22684	22090	21817	21773	21816
	4	22086	23207	23064	21082	22988
	6	21531	23396	23130	22124	23131
	8	22141	22568	22743	22038	22763
RTX 4090	2	21948	22156	21993	21705	21987
	4	22224	23055	23199	20992	23181
	6	21944	MISSING*	23695	22003	23138
A6000	2	38117	43926	43641	26064	43641
	4	36059	40986	38665	32496	38642
	6	28546	34141	33961	31079	33981
	8	31020	31440	33448	31245	33499

Training Speed(s/it)

GPU Name	# of GPUs	DDP	ZeRO-1	ZeRO-2	ZeRO-3	ZeRO-2 Optimizer cpu offloading
RTX 3090	2	57.31	52.40	56.29	156.13	56.48
	4	58.78	54.07	57.65	179.51	56.78
	6	59.57	55.31	60.12	426.70	61.33
	8	53.92	49.99	55.16	165.99	54.26
RTX 4090	2	30.16	27.90	29.36	66.05	29.41
	4	31.36	29.43	31.11	77.10	32.79
	6	31.79	MISSING*	41.31	117.37	31.95
A6000	2	48.37	42.36	44.73	71.43	43.77
	4	54.13	44.26	44.34	78.75	45.91
	6	49.05	43.56	44.79	121.62	45.53
	8	46.27	37.56	42.29	85.80	44.37

*I lost this element.

Results

Wrapping models(U-Net, TEs, and network) just works.
ZeRO-1 stage is most capable strategy, IMO.
If you want to use multiple of odd number gpus in ZeRO-3 stage, you will meet very slow script.
2 x 24GB VRAM gpus can run FT on ZeRO-2 stage with CPU offloading.
Ada Lovelace is super fast.

FurkanGozukara · 2024-02-19T12:22:50Z

the only way to utilize multiple consumer GPU is i think cloning the training if you don't have pro GPUs.

tinbtb · 2024-02-19T14:59:26Z

full_fp16, only active for DDP

Why not use bf16 for these cards?

Training Speed(s/it). Lower is Better.
29.23

Could you please compare it with training on a single card? As far as I remember I get roughly the same speeds with just one card.

BootsofLagrangian · 2024-02-19T23:33:21Z

full_fp16, only active for DDP

Why not use bf16 for these cards?

Training Speed(s/it). Lower is Better.
29.23

Could you please compare it with training on a single card? As far as I remember I get roughly the same speeds with just one card.

full_bf16 and bf16 is on running.

For a single card, training speed is almost same as DDP, DDP is slight slower.

mchdks · 2024-02-19T23:38:32Z

full_fp16, only active for DDP

Why not use bf16 for these cards?

Training Speed(s/it). Lower is Better.
29.23

Could you please compare it with training on a single card? As far as I remember I get roughly the same speeds with just one card.

full_bf16 and bf16 is on running.

For a single card, training speed is almost same as DDP, DDP is slight slower.

The results look very promising, I can help you by providing you with different multi-gpu machines if it will help with your tests.

FurkanGozukara · 2024-02-20T11:39:50Z

what is your effective batch size?

this is cloned on each GPU?

like 2 gpu means 4 * 16 * 2 ?

BootsofLagrangian · 2024-02-21T04:45:22Z

full_fp16, only active for DDP

Why not use bf16 for these cards?

Training Speed(s/it). Lower is Better.
29.23

Could you please compare it with training on a single card? As far as I remember I get roughly the same speeds with just one card.

full_bf16 and bf16 is on running.
For a single card, training speed is almost same as DDP, DDP is slight slower.

The results look very promising, I can help you by providing you with different multi-gpu machines if it will help with your tests.

Thank your suggestion. But I think now-days Diffusion Models are not too big and necessary to run with multi-machines or multi-nodes.

BootsofLagrangian · 2024-02-21T04:53:25Z

what is your effective batch size?

this is cloned on each GPU?

like 2 gpu means 4 * 16 * 2 ?

Effective Batch calculation in sd-scripts is on below

[effective batch] = [number of machine] x [number of gpus] x [train_batch_size] x [gradient_accumulation_steps]

For example, in my 2 gpu settings.

effective batch = 1 machine x 2 gpus x 4 train batch size x 16 gradient accumulation steps
= 1 x 2 x 4 x 16 = 128

mchdks · 2024-02-22T00:27:55Z

full_fp16, only active for DDP

Why not use bf16 for these cards?

Training Speed(s/it). Lower is Better.
29.23

Could you please compare it with training on a single card? As far as I remember I get roughly the same speeds with just one card.

full_bf16 and bf16 is on running.
For a single card, training speed is almost same as DDP, DDP is slight slower.

The results look very promising, I can help you by providing you with different multi-gpu machines if it will help with your tests.

Thank your suggestion. But I think now-days Diffusion Models are not too big and necessary to run with multi-machines or multi-nodes.

Actually, I was talking about multi-GPUs, not multi-machines. If you want to test on 8xa100 and a10g, please send a message.

mchdks · 2024-02-23T09:09:35Z

Here is a report of deepspeed in sd-scripts

Environment

Model

Stable Diffusion XL 1.0

GPUs

24GB VRAM. No NVLink

2 x RTX 3090, 4 x RTX 3090, 6 x RTX 3090, and 8 x RTX 3090

2 x RTX 4090, 4 x RTX 4090, and 6 x RTX 4090

48GB VRAM. No NVLink

2 x A6000, 4 x A6000, 6 x A6000, and 8 x A6000

No NVLink make bottleneck among GPU communication.

Requirements

torch 2.2.0 cu121

bitsandbytes 0.0.42

AdamW8bit

accelerate 0.25.0

deepspeed 0.13.1

lycoris 2.0.2

LoCon

Training Settings

Methodology

Full Finetuning(FT). sdxl_train.py

LoCon with rank=16(PEFT). sdxl_train_network.py

Training Precision

full_fp16, only active for DDP

full_bf16

bf16

Distributed Method

Distributed Data Parallel(DDP). sd-scripts and accelerate default

ZeRO. stage 1, 2, and 3

In both full_bf16 and bf16 training, I also tested ZeRO stage 2 with optimizer cpu offloading.

If deepspeed engine face OOM, they try to move variables from VRAM to RAM. i.e. if no OOM, it same as ZeRO-2.

Resolution

1024x1024 with aspect ratio bucketing(ARB)

Batch and Optimizer

batch size = 4

gradient accumulation steps = 16

AdamW8bit

With 24GB VRAM, sd-scripts can run barely PEFT, and can not run FT on DDP.

Experiment

Fisrt, I'm sorry for some missing element of table. My budget is limited.

Lower is better.

full_fp16, FT

Average VRAM usage(MB)

| GPU Name | # of GPUs | DDP | ZeRO-1 | ZeRO-2 | ZeRO-3 |

|:---------:|:---------:|:------:|:------:|:------:|:------:|

| RTX 3090 | 2 | OOM | OOM | OOM | OOM |

| | 4 | OOM | OOM | OOM | 21539 |

| | 6 | OOM | OOM | OOM | 21569 |

| | 8 | OOM | OOM | 23400 | 21013 |

| RTX 4090 | 2 | OOM | OOM | OOM | OOM |

| | 4 | OOM | OOM | OOM | 21154 |

| | 6 | OOM | OOM | OOM | 21571 |

| A6000 | 2 | 42672 | 46391 | 45084 | 45277 |

| | 4 | 43007 | 40060 | 36713 | 40912 |

| | 6 | 41655 | 39553 | 36725 | 41537 |

| | 8 | 42609 | 33300 | 29475 | 30474 |

Training Speed(s/it)

| GPU Name | # of GPUs | DDP | ZeRO-1 | ZeRO-2 | ZeRO-3 |

|:---------:|:---------:|:------:|:------:|:------:|:------:|

| RTX 3090 | 2 | - | - | - | - |

| | 4 | - | - | - | 191.73 |

| | 6 | - | - | - | 477.72 |

| | 8 | - | - | 143.58 | 190.09 |

| RTX 4090 | 2 | - | - | - | - |

| | 4 | - | - | - | 50.02 |

| | 6 | - | - | - | 104.65 |

| A6000 | 2 | 42.03 | 37.35 | 43.61 | 45.80 |

| | 4 | 46.18 | 37.90 | 45.64 | 49.48 |

| | 6 | 47.98 | 38.42 | 48.32 | 104.51 |

| | 8 | 40.97 | 35.51 | 45.59 | 46.63 |

full_fp16, PEFT

Average VRAM usage(MB)

| GPU Name | # of GPUs | DDP | ZeRO-1 | ZeRO-2 | ZeRO-3 |

|:---------:|:---------:|:------:|:------:|:------:|:------:|

| RTX 3090 | 2 | 23282 | 22097 | 22026 | 21045 |

| | 4 | 23113 | 23206 | 23154 | 20968 |

| | 6 | 22764 | 23280 | 23034 | 21949 |

| | 8 | 23109 | 22582 |SIGABRT*| 21955 |

| RTX 4090 | 2 | 22450 | 22163 | 22175 | 20791 |

| | 4 | 22720 | 23059 | 23528 | 22209 |

| | 6 | 21767 | OOM* | 23167 | 22142 |

| A6000 | 2 | 37697 | 44323 | 43847 | 26219 |

| | 4 | 35612 | 41004 | 38782 | 31543 |

| | 6 | 28099 | 34084 | 34044 | 31085 |

| | 8 | 30579 | 31521 | 31513 | 30701 |

Training Speed(s/it)

| GPU Name | # of GPUs | DDP | ZeRO-1 | ZeRO-2 | ZeRO-3 |

|:---------:|:---------:|:------:|:------:|:------:|:------:|

| RTX 3090 | 2 | 58.64 | 51.70 | 56.25 | 114.71 |

| | 4 | 59.27 | 55.92 | 58.37 | 167.77 |

| | 6 | 58.32 | 52.61 | 57.07 | 375.46 |

| | 8 | 53.08 | 48.88 |SIGABRT*| 165.85 |

| RTX 4090 | 2 | 30.33 | 29.23 | 29.76 | 66.43 |

| | 4 | 32.36 | 29.42 | 31.17 | 75.34 |

| | 6 | 32.91 | - | 31.91 | 112.83 |

| A6000 | 2 | 45.89 | 41.96 | 42.80 | 70.17 |

| | 4 | 48.72 | 42.47 | 43.39 | 79.54 |

| | 6 | 46.86 | 41.24 | 44.93 | 117.68 |

| | 8 | 45.11 | 39.91 | 39.58 | 84.05 |

*SIGABRT occurred. IDK why.

*Ideally this OOM should not to be happened.

full_bf16, FT

Average VRAM usage(MB)

| GPU Name | # of GPUs | DDP | ZeRO-1 | ZeRO-2 | ZeRO-3 | ZeRO-2 Optimizer cpu offloading |

|:---------:|:---------:|:------:|:------:|:------:|:------:|:-------------------------------:|

| RTX 3090 | 2 | OOM | OOM | OOM | OOM | 23553 |

| | 4 | OOM | OOM | OOM | 22206 | 23534 |

| | 6 | OOM | OOM | OOM | 22059 | 23435 |

| | 8 | OOM | OOM | OOM* | 21185 | 23388 |

| RTX 4090 | 2 | OOM | OOM | OOM | OOM | 23709 |

| | 4 | OOM | OOM | OOM | 21646 | 23689 |

| | 6 | OOM | OOM | OOM | 20867 | 23605 |

| A6000 | 2 | 42673 | 46368 | 45117 | 42914 | 42320 |

| | 4 | 42995 | 39963 | 36662 | 39542 | 33359 |

| | 6 | 41653 | 39357 | 36639 | 40631 | 31007 |

| | 8 | 41953 | 33268 | 29594 | 30988 | 32420 |

Training Speed(s/it)

| GPU Name | # of GPUs | DDP | ZeRO-1 | ZeRO-2 | ZeRO-3 | ZeRO-2 Optimizer cpu offloading |

|:---------:|:---------:|:------:|:------:|:------:|:------:|:-------------------------------:|

| RTX 3090 | 2 | - | - | - | - | 112.18 |

| | 4 | - | - | - | 113.79 | 123.54 |

| | 6 | - | - | - | 481.29 | 183.95 |

| | 8 | - | - | OOM* | 192.69 | 167.70 |

| RTX 4090 | 2 | - | - | - | - | 45.24 |

| | 4 | - | - | - | 50.01 | 48.54 |

| | 6 | - | - | - | 105.13 | 48.30 |

| A6000 | 2 | 42.79 | 35.99 | 42.76 | 45.39 | 57.25 |

| | 4 | 42.61 | 38.42 | 47.54 | 46.59 | 60.92 |

| | 6 | 43.09 | 40.05 | 49.84 | 103.77 | 60.39 |

| | 8 | 47.67 | 36.09 | 48.59 | 50.04 | 60.16 |

*Ideally this OOM should not to be happened.

full_bf16, PEFT

Average VRAM usage(MB)

| GPU Name | # of GPUs | DDP | ZeRO-1 | ZeRO-2 | ZeRO-3 | ZeRO-2 Optimizer cpu offloading |

|:---------:|:---------:|:------:|:------:|:------:|:------:|:-------------------------------:|

| RTX 3090 | 2 | 22413 | 22099 | 22016 | 20786 | 21830 |

| | 4 | 23118 | 23204 | 23139 | 20668 | 23010 |

| | 6 | 22736 | 23377 | 22971 | 21960 | 23056 |

| | 8 | 23192 | 22576 | 22575 | 21767 | 23056 |

| RTX 4090 | 2 | 22432 | 22146 | 22175 | 20896 | 22001 |

| | 4 | 22715 | 23077 | 23247 | 21329 | 23150 |

| | 6 | 21570 | OOM* | 23170 | 22618 | 23175 |

| A6000 | 2 | 37674 | 43901 | 43813 | 26004 | 43606 |

| | 4 | 35439 | 40972 | 38718 | 32233 | 38611 |

| | 6 | 28098 | 34106 | 34011 | 31077 | 33971 |

| | 8 | 30590 | 31429 | 32644 | 30472 | 33516 |

Training Speed(s/it)

| GPU Name | # of GPUs | DDP | ZeRO-1 | ZeRO-2 | ZeRO-3 | ZeRO-2 Optimizer cpu offloading |

|:---------:|:---------:|:------:|:------:|:------:|:------:|:-------------------------------:|

| RTX 3090 | 2 | 57.72 | 52.67 | 54.85 | 108.90 | 55.82 |

| | 4 | 57.01 | 52.24 | 55.58 | 123.89 | 56.76 |

| | 6 | 57.33 | 52.95 | 57.02 | 409.32 | 57.65 |

| | 8 | 53.62 | 49.53 | 52.23 | 167.19 | 52.57 |

| RTX 4090 | 2 | 34.38 | 28.74 | 29.97 | 69.83 | 34.20 |

| | 4 | 31.66 | 30.12 | 30.46 | 75.44 | 30.93 |

| | 6 | 32.48 | - | 32.68 | 110.62 | 34.28 |

| A6000 | 2 | 46.19 | 40.35 | 41.49 | 67.89 | 43.10 |

| | 4 | 45.88 | 41.03 | 42.82 | 76.99 | 43.41 |

| | 6 | 46.30 | 42.03 | 43.75 | 117.78 | 43.91 |

| | 8 | 45.93 | 37.86 | 54.25 | 87.34 | 42.61 |

*Ideally this OOM should not to be happened.

bf16, FT

Average VRAM usage(MB)

| GPU Name | # of GPUs | DDP | ZeRO-1 | ZeRO-2 | ZeRO-3 | ZeRO-2 Optimizer cpu offloading |

|:---------:|:---------:|:------:|:------:|:------:|:------:|:-------------------------------:|

| RTX 3090 | 2 | - | - | - | - | 23549 |

| | 4 | - | - | - | 22041 | 23509 |

| | 6 | - | - | - | 21477 | 23454 |

| | 8 | - | - | - | 20672 | 23398 |

| RTX 4090 | 2 | - | - | - | - | 23712 |

| | 4 | - | - | - | 21620 | 23627 |

| | 6 | - | - | - | 21234 | 23660 |

| A6000 | 2 | - | 46397 | 45082 | 42974 | 42320 |

| | 4 | - | 40096 | 36711 | 39840 | 33405 |

| | 6 | - | 39357 | 36680 | 40632 | 31034 |

| | 8 | - | 33289 | 29604 | 30599 | 32457 |

Training Speed(s/it)

| GPU Name | # of GPUs | DDP | ZeRO-1 | ZeRO-2 | ZeRO-3 | ZeRO-2 Optimizer cpu offloading |

|:---------:|:---------:|:------:|:------:|:------:|:------:|:-------------------------------:|

| RTX 3090 | 2 | - | - | - | - | 172.08 |

| | 4 | - | - | - | 216.22 | 195.97 |

| | 6 | - | - | - | 506.07 | 189.02 |

| | 8 | - | - | - | 189.56 | 165.13 |

| RTX 4090 | 2 | - | - | - | - | 44.25 |

| | 4 | - | - | - | 53.28 | 52.91 |

| | 6 | - | - | - | 110.34 | 56.24 |

| A6000 | 2 | - | 36.68 | 46.37 | 45.86 | 61.53 |

| | 4 | - | 41.75 | 46.25 | 48.04 | 61.95 |

| | 6 | - | 41.84 | 49.43 | 104.32 | 65.88 |

| | 8 | - | 35.50 | 45.23 | 47.68 | 56.39 |

bf16, PEFT

Average VRAM usage(MB)

| GPU Name | # of GPUs | DDP | ZeRO-1 | ZeRO-2 | ZeRO-3 | ZeRO-2 Optimizer cpu offloading |

|:---------:|:---------:|:------:|:------:|:------:|:------:|:-------------------------------:|

| RTX 3090 | 2 | 22684 | 22090 | 21817 | 21773 | 21816 |

| | 4 | 22086 | 23207 | 23064 | 21082 | 22988 |

| | 6 | 21531 | 23396 | 23130 | 22124 | 23131 |

| | 8 | 22141 | 22568 | 22743 | 22038 | 22763 |

| RTX 4090 | 2 | 21948 | 22156 | 21993 | 21705 | 21987 |

| | 4 | 22224 | 23055 | 23199 | 20992 | 23181 |

| | 6 | 21944 | MISSING*| 23695 | 22003 | 23138 |

| A6000 | 2 | 38117 | 43926 | 43641 | 26064 | 43641 |

| | 4 | 36059 | 40986 | 38665 | 32496 | 38642 |

| | 6 | 28546 | 34141 | 33961 | 31079 | 33981 |

| | 8 | 31020 | 31440 | 33448 | 31245 | 33499 |

Training Speed(s/it)

| GPU Name | # of GPUs | DDP | ZeRO-1 | ZeRO-2 | ZeRO-3 | ZeRO-2 Optimizer cpu offloading |

|:---------:|:---------:|:------:|:------:|:------:|:------:|:-------------------------------:|

| RTX 3090 | 2 | 57.31 | 52.40 | 56.29 | 156.13 | 56.48 |

| | 4 | 58.78 | 54.07 | 57.65 | 179.51 | 56.78 |

| | 6 | 59.57 | 55.31 | 60.12 | 426.70 | 61.33 |

| | 8 | 53.92 | 49.99 | 55.16 | 165.99 | 54.26 |

| RTX 4090 | 2 | 30.16 | 27.90 | 29.36 | 66.05 | 29.41 |

| | 4 | 31.36 | 29.43 | 31.11 | 77.10 | 32.79 |

| | 6 | 31.79 | MISSING*| 41.31 | 117.37 | 31.95 |

| A6000 | 2 | 48.37 | 42.36 | 44.73 | 71.43 | 43.77 |

| | 4 | 54.13 | 44.26 | 44.34 | 78.75 | 45.91 |

| | 6 | 49.05 | 43.56 | 44.79 | 121.62 | 45.53 |

| | 8 | 46.27 | 37.56 | 42.29 | 85.80 | 44.37 |

*I lost this element.

Results

Wrapping models(U-Net, TEs, and network) just works.

ZeRO-1 stage is most capable strategy, IMO.

If you want to use multiple of odd number gpus in ZeRO-3 stage, you will meet very slow script.

2 x 24GB VRAM gpus can run FT on ZeRO-2 stage with CPU offloading.

Ada Lovelace is super fast.

I tried it with a clean installation on a new machineto do a test, but I received warnings and errors that were too long to include here, unfortunately the result was unsuccessful. Could there be a requirement you missed?

BootsofLagrangian · 2024-02-23T16:50:38Z

@mchdks

Here is a simple yet all about installation.

installation

I recommend to use deepspeed anaconda envrionment. First, you need to clone my deepspeed branch.

install anaconda
- I recommend python=3.10
install anaconda deepspeed env via yml file.
- conda create -n deepspeed --file=/path/to/YAL_FILE.yml
move dir to sd-scripts deepspeed branch
- cd /path/to/deepspeed/branch
activate anaconda environment
- conda activate deepspeed
install pytorch and sd-scripts requirements.txt
- pip install torch==2.2.0 torchvision==0.17.0 --index-url https://download.pytorch.org/whl/cu121
- pip install -r requirements.txt
- tested torch version is 2.2.0 but it can run on other torch version >= 2.0.1.
install bitsandbytes, xformers, and lycoris
- pip install bitsandbytes xformers lycoris_lora
install deepspeed
- DS_BUILD_OPS=0 pip install deepspeed==0.13.1
accelerate config
prepare test image set and script configuration file, like

CONFIG_FILE.toml
pretrained_model_name_or_path = "./training/base_model/sd_xl_base_1.0.safetensors"
xformers = true
deepspeed = true
zero_stage = 1
mixed_precision = "bf16"
save_precision = "bf16"
full_bf16 = true
output_name = "full_bf16_ff_zero_1"
output_dir = "./training/ds_test/model"
train_data_dir = "./training/test_img"
shuffle_caption = true
caption_extension = ".txt"
random_crop = true
resolution = "1024,1024"
enable_bucket = true
bucket_no_upscale = true
save_every_n_epochs = 1
train_batch_size = 4
max_token_length = 225
max_train_epochs = 1
max_data_loader_n_workers = 4
persistent_data_loader_workers = true
seed = 42
gradient_checkpointing = true
gradient_accumulation_steps = 16
logging_dir = "./training/ds_test/logs"
caption_separator = ". "
noise_offset = 0.0357
learning_rate = 1e-4
unet_lr = 1e-4
learning_rate_te1 = 5e-5
learning_rate_te2 = 5e-5
train_text_encoder = true
max_grad_norm = 1.0
optimizer_type = "AdamW8bit"
save_model_as = "safetensors"
optimizer_args = [ "weight_decay=1e-1", ]
lr_scheduler = "constant_with_warmup"
lr_warmup_steps = 340
no_half_vae = true

run script like this.

accelerate launch --mixed_precision=bf16 \
        --num_processes=8 --num_machines=1 --multi_gpu \
        --main_process_ip=localhost --main_process_port=29555 \
        --num_cpu_threads_per_process=4 \
        ./sdxl_train.py --config_file=$CONFIG_FILE

CONFIG_FILE is a path to above toml file.

you will meet very long warnings and logs. but just ignore them, it doesn't effect on training.

kohya-ss · 2024-02-24T09:52:33Z

Thank you for this great PR! It looks very nice.

However, I dont' have an environment to test DeepSpeed. I know that I can test it with cloud environments, but I prefer to develop other features than testing DeepSpeed.

In addition, the update to the scripts is not a little, so it will be a little hard to maintain.

Therefore, is it OK if I move the features to the single script which supports DeepSpeed as much as possible after merging? I will make a new branch for it, and I'd be happy if you test and review the branch.

I think that if the script works well, it will not be required for me to maintain the script in future, and someone would update the script if necessary.

BootsofLagrangian · 2024-02-25T08:00:48Z

Thank you for this great PR! It looks very nice.

However, I dont' have an environment to test DeepSpeed. I know that I can test it with cloud environments, but I prefer to develop other features than testing DeepSpeed.

In addition, the update to the scripts is not a little, so it will be a little hard to maintain.

Therefore, is it OK if I move the features to the single script which supports DeepSpeed as much as possible after merging? I will make a new branch for it, and I'd be happy if you test and review the branch.

I think that if the script works well, it will not be required for me to maintain the script in future, and someone would update the script if necessary.

Sounds good. It is good to move DeepSpeed features into dev-branch and to postpone merging it.

kohya-ss · 2024-02-27T11:01:24Z

@BootsofLagrangian
Hi! I've merged the PR to the new branch, and I'm refactoring a bit the code.

I have a question for DeepSpeedWrapper. In my understanding, Accelerate library does not support multiple models for deepspeed. So, we need to wrap multiple models into a single model.

If this is correct, when we pass the wrapper to accelerator.accumulate, the argument for accumulate might be the wrapper, instead of the list of models. Because accumulate takes the prepared model.

Therefore, training_models = [ds_model] might be ok. Is this correct?

BootsofLagrangian · 2024-02-27T13:22:03Z

@BootsofLagrangian Hi! I've merged the PR to the new branch, and I'm refactoring a bit the code.

I have a question for DeepSpeedWrapper. In my understanding, Accelerate library does not support multiple models for deepspeed. So, we need to wrap multiple models into a single model.

If this is correct, when we pass the wrapper to accelerator.accumulate, the argument for accumulate might be the wrapper, instead of the list of models. Because accumulate takes the prepared model.

Therefore, training_models = [ds_model] might be ok. Is this correct?

Yep, that is correct. accelerate do something magical, accumulate method accepts accelerate-compatible Modules.

I tested training_models = [ds_model] and it works.

kohya-ss · 2024-02-27T14:07:32Z

Thank you for clarification! I opened a new PR #1139, I would appreciate your comments and suggestions.

storuky · 2024-03-13T14:58:09Z

I got very optimistic results!
I have only 3x4090 in my PC and I was able to run AdamW (not Adam8bit) optimizer on BF16 (without full_bf16) with batch size 16 and 1st text encoder training.
I used zero stage 2 and cpu offloading and cached latents on disk.
Speed is incredible: 10s/it 😱
It took 123 GB of RAM.
Just to note: to run AdamW with batch size 16 you need ~ 75GB VRAM (A100 or H100) and it performs with 3.5s/it (H100) and 5s/it (A100) but effective batch is 16 while for 3x4090 it's 48!

My dataset: 353 images
Repeats: 40
Epochs: 5

Training time
3x4090: 4 hours
1xH100: 4.2 hours
1xA100: 6.1 hours

Unbelievable, but 3x4090 faster than 1 H100 Pcie.

@BootsofLagrangian you are a wizard!

FurkanGozukara · 2024-03-13T15:28:24Z

@storuky wow nice results

support deepspeed

dfe08f3

BootsofLagrangian added 3 commits February 5, 2024 17:11

fix offload_optimizer_device typo

64873c1

fix vae type error during training sdxl

2824312

fix all trainer about vae

4295f91

BootsofLagrangian added 5 commits February 5, 2024 22:40

maybe fix branch to run offloading

3970bf4

apply offloading method runable for all trainer

7d2a926

fix full_fp16 compatible and train_step

6255661

remove test requirements

2445a5b

forgot setting mixed_precision for deepspeed. sorry

a98feca

the reason not working grad accum steps found. it was becasue of my a…

03f0816

…ccelerate settings

refactored codes, some function moved into train_utils.py

4d5186d

kohya-ss changed the base branch from main to deep-speed February 27, 2024 09:55

Merge branch 'deep-speed' into deepspeed

eefb3cc

kohya-ss merged commit 0e4a573 into kohya-ss:deep-speed Feb 27, 2024
1 check passed

kohya-ss mentioned this pull request Feb 27, 2024

Deep speed #1139

Merged

BootsofLagrangian deleted the deepspeed branch March 20, 2024 12:00

bmaltais mentioned this pull request Apr 7, 2024

v23.1.0 bmaltais/kohya_ss#2219

Merged

support deepspeed #1101

support deepspeed #1101

Conversation

BootsofLagrangian commented Feb 3, 2024 • edited Loading

Introduction

0. Environment

1. Install DeepSpeed

2. Configure Accelerate

3. Use in Your Scripts

toml Configuration File

Bash Argument

CPU/NVMe offloading

full_fp16 training

Note

FurkanGozukara commented Feb 4, 2024

BootsofLagrangian commented Feb 5, 2024 • edited Loading

FurkanGozukara commented Feb 5, 2024

BootsofLagrangian commented Feb 5, 2024 • edited Loading

FurkanGozukara commented Feb 7, 2024

mchdks commented Feb 10, 2024

BootsofLagrangian commented Feb 19, 2024 • edited Loading

Environment

Model

GPUs

Requirements

Training Settings

Experiment

full_fp16, FT

Average VRAM usage(MB)

Training Speed(s/it)

full_fp16, PEFT

Average VRAM usage(MB)

Training Speed(s/it)

full_bf16, FT

Average VRAM usage(MB)

Training Speed(s/it)

full_bf16, PEFT

Average VRAM usage(MB)

Training Speed(s/it)

bf16, FT

Average VRAM usage(MB)

Training Speed(s/it)

bf16, PEFT

Average VRAM usage(MB)

Training Speed(s/it)

Results

FurkanGozukara commented Feb 19, 2024

tinbtb commented Feb 19, 2024

BootsofLagrangian commented Feb 19, 2024

mchdks commented Feb 19, 2024

FurkanGozukara commented Feb 20, 2024

BootsofLagrangian commented Feb 21, 2024

BootsofLagrangian commented Feb 21, 2024

mchdks commented Feb 22, 2024

mchdks commented Feb 23, 2024

Environment

Model

GPUs

Requirements

Training Settings

Experiment

full_fp16, FT

Average VRAM usage(MB)

Training Speed(s/it)

full_fp16, PEFT

Average VRAM usage(MB)

Training Speed(s/it)

full_bf16, FT

Average VRAM usage(MB)

Training Speed(s/it)

full_bf16, PEFT

Average VRAM usage(MB)

Training Speed(s/it)

bf16, FT

Average VRAM usage(MB)

Training Speed(s/it)

bf16, PEFT

Average VRAM usage(MB)

Training Speed(s/it)

Results

BootsofLagrangian commented Feb 3, 2024 •

edited

Loading

BootsofLagrangian commented Feb 5, 2024 •

edited

Loading

BootsofLagrangian commented Feb 5, 2024 •

edited

Loading

BootsofLagrangian commented Feb 19, 2024 •

edited

Loading

BootsofLagrangian commented Feb 23, 2024 •

edited

Loading

storuky commented Mar 13, 2024 •

edited

Loading