deepspeed #288

haqishen · 2023-07-18T11:05:41Z

Feature

deepspeed zero3 training
deepspeed zero3 /w lora training
deepspeed zero3 /w offload optimizer training

Experiment Result

Using 3 x RTX6000 (24GB), Batchsize = 1

Full params experiments

--	backbone	dtype	LORA	deepspeed	offload optimizer (live param 1e9)	RAM usage per GPU	runtime(data sample 0.1)	Perplexity
exp1	EleutherAI/pythia-1b-deduped	bfloat16	False	False	False	15GB	00:03:28	11.9258
exp2	EleutherAI/pythia-1b-deduped (lr 0.00001)	float16	False	True	False	11.5GB	00:04:31	10.1721
exp3	EleutherAI/pythia-1b-deduped (lr 0.00001)	float16	False	True	True	5GB	00:29:07	10.1965
exp4	EleutherAI/pythia-2.8b-deduped	bfloat16	False	False	False	OOM	N/A	N/A
exp5	EleutherAI/pythia-2.8b-deduped	float16	False	True	False	OOM	N/A	N/A
exp6	EleutherAI/pythia-2.8b-deduped	float16	False	True	True	10.5G	01:18:44	8.7803
exp7	EleutherAI/pythia-6.9b-deduped	float16	False	True	True (live param 1e10)	23G	OOM (cpu)	OOM (cpu)

LORA experiments

--	backbone	dtype	LORA	deepspeed	RAM usage per GPU	runtime(data sample 0.1)	Perplexity
exp1	EleutherAI/pythia-2.8b-deduped	float16	True	False	11.5GB	00:00:57	9.7002
exp2	EleutherAI/pythia-2.8b-deduped	float16	True	True	4.5GB	00:08:19	9.8184
exp3	EleutherAI/pythia-6.9b-deduped	float16	True	False	16.5GB	00:01:40	9.4829
exp4	EleutherAI/pythia-6.9b-deduped	float16	True	True	8.5GB	00:23:35	9.6707
exp5	EleutherAI/pythia-12b-deduped	float16	True	False	OOM	NA	NA
exp6	EleutherAI/pythia-12b-deduped	float16	True	True	12.5GB	00:46:32	9.1973
exp7	EleutherAI/pythia-12b-deduped	int8	True	False	17GB	00:06:58	10.1232
exp8	EleutherAI/pythia-20b-deduped	float16	True	True	18.5GB	00:56:51	8.4031

Using 8 x V100 w/ NVLink (16GB), Batchsize = 1

--	backbone	dtype	LORA	deepspeed	RAM usage per GPU	runtime(data sample 0.1)	Perplexity
exp1	EleutherAI/pythia-20b-deduped	int4	True	False	15.5GB	00:02:29	9.3201
exp2	EleutherAI/pythia-20b-deduped	float16	True	True	10.5GB	00:04:19	8.7182

Using 8 x A6000 w/o NVLink (48GB), Batchsize = 1

--	backbone	dtype	LORA	deepspeed	RAM usage per GPU	runtime(data sample 0.1)	Perplexity
exp1	tiiuae/falcon-40b	int4	True	False	45GB	00:12:41	5.7722
exp2	tiiuae/falcon-40b	float16	True	True	22GB	02:30:52	5.7743
exp3	TheBloke/Llama-2-70B-Chat-fp16	int4	True	False	46GB	00:20:56	4.4524
exp4	TheBloke/Llama-2-70B-Chat-fp16	float16	True	True	28GB	04:30:58	4.4221

NVLINK works:

--	backbone	dtype	LORA	deepspeed	RAM usage per GPU	runtime(data sample 0.1)
w/ NVLINK	EleutherAI/pythia-20b-deduped	float16	True	True	10.5GB	00:04:19
w/o NVLINK	EleutherAI/pythia-20b-deduped	float16	True	True	10.5GB	00:45:37

Using 8 x A100 SMX4

--	backbone	dtype	LORA	deepspeed	RAM usage per GPU	runtime
exp1	TheBloke/Llama-2-70B-Chat-fp16 (4k)	int4	True	False	~80GB	35h
exp2	TheBloke/Llama-2-70B-Chat-fp16 (4k)	float16	True	True	~80GB	6.5h
exp3	h2oai/h2ogpt-4096-llama2-13b-chat	float16	True	True	11GB	16min
exp4	h2oai/h2ogpt-4096-llama2-13b-chat	float16	True	False	38GB	13min

Check

Chat
Upload model weight

Future Work

zero ++
zero3 /w lora and offload optimizer
zero3 /w offload params

llm_studio/src/utils/modeling_utils.py

pascal-pfeiffer

Thanks a lot for working on this tricky topic, @haqishen

I think we can remove offloading to CPU for the time being, as the speed apparently drops way too much.

I already added a few comments, will need to do some proper testing on a multi GPU setup and also with continued training, upload of weights, etc. I assume, you checked these already?

train.py

llm_studio/src/utils/modeling_utils.py

…peed

Pipfile

psinger · 2023-09-11T12:35:08Z

One more find, could this be useful for saving?
https://deepspeed.readthedocs.io/en/stable/zero3.html#deepspeed.runtime.zero.config.DeepSpeedZeroConfig.gather_16bit_weights_on_model_save

haqishen · 2023-09-26T12:06:44Z

Do I read the table right, that after NVLINK / SMX4 the runtime goes down a lot with deepspeed? Does it match ddp runtime then?

Just updated the table by adding some exp result to compare ddp and deepspeed runtime. It shows that deepspeed still slower than ddp fp16 by around 15~20%, but much faster than ddp int8 or int4.

Can we make these sliders?

How to do this? I search the keyword data_sample but cannot find why it's a slider bar in webui.

Let's fully remove FSDP in favor of Deepspeed

Let's do it in a new PR.

This reverts commit 62fc9c5.

psinger · 2023-10-09T11:46:56Z

@haqishen I believe we still have not solved the desync issue, when long-running generate is running, and then the checkpoint is saved afterwards. I believe we should move the order there as discussed earlier. So save the checkpoint before running eval, if not best epoch saving is enabled.

haqishen · 2023-10-10T16:30:43Z

@haqishen I believe we still have not solved the desync issue, when long-running generate is running, and then the checkpoint is saved afterwards. I believe we should move the order there as discussed earlier. So save the checkpoint before running eval, if not best epoch saving is enabled.

btw what's your experiment setting?

psinger · 2023-10-12T14:00:43Z

@haqishen I believe we still have not solved the desync issue, when long-running generate is running, and then the checkpoint is saved afterwards. I believe we should move the order there as discussed earlier. So save the checkpoint before running eval, if not best epoch saving is enabled.

btw what's your experiment setting?

default with gpt metric

…peed

psinger

I fixed another issue.

Need to merge main, solve conflicts and then can merge.

After that, please open subsequent issues for things not tackled yet in this pr and potential future improvements.

We also potentiall need a section in README / Docs and might be also useful to share your benchmarks there.

Thanks!

outdated

deepspeed

d1ea793

haqishen commented Jul 18, 2023

View reviewed changes

llm_studio/src/utils/modeling_utils.py Outdated Show resolved Hide resolved

haqishen added 3 commits July 19, 2023 05:53

shard

5ef4792

full param deepspeed works by this commit

b5fac57

offload optimizer & documentation

0f7086b

pascal-pfeiffer mentioned this pull request Aug 2, 2023

[BUG] OOM errors during fine-tuning with Polyglot 12.8B models #330

Closed

haqishen added 2 commits August 2, 2023 08:05

format & fix save deepspeed weight

687c456

format & update save_checkpoint

3b7ff0d

pascal-pfeiffer previously requested changes Aug 3, 2023

View reviewed changes

haqishen and others added 15 commits August 4, 2023 01:12

update pipfile

105a849

update pipfile

f583395

zero init for transformers

cbc50fb

add some new config

ffee1c0

fix bug

f40ef52

min 1e6

9cd37ab

update deepspeed config

69e9eb1

Merge main to deepspeed

1415cdc

Merge branch 'main' into deepspeed

9db3fbd

Update requirements.txt

b0df016

remove duplicate code

d30b51c

Merge branch 'deepspeed' of github.com:h2oai/h2o-llmstudio into deeps…

a4b76c3

…peed

throw warning when compile w/ deepspeed

67629ee

black

48d7f71

integrate deepspeed into wrap_model_distributed

d1efef5

haqishen force-pushed the deepspeed branch from f6fe027 to d1efef5 Compare August 18, 2023 06:14

haqishen added 2 commits August 17, 2023 23:17

remove unuse code

d6b0748

style

3f89359

psinger mentioned this pull request Aug 18, 2023

[FEATURE] Allow model parallelism option when training with multiple GPUs #239

Closed

haqishen added 2 commits August 18, 2023 00:33

fix bug

5c253f2

fix bug

9ff717f

pascal-pfeiffer reviewed Sep 8, 2023

View reviewed changes

Pipfile Outdated Show resolved Hide resolved

psinger and others added 3 commits September 8, 2023 14:28

merge

0f40322

lock

bd1e134

Update requirements.txt

6f81182

haqishen added 2 commits September 26, 2023 01:54

improve model saving for deepspeed

62fc9c5

solved INFLIGHT problem

dbbbcdf

haqishen added 4 commits September 26, 2023 05:10

update doc

c023d19

deepspeed default push to hub by cpu

2785f9f

Revert "improve model saving for deepspeed"

aa17c0b

This reverts commit 62fc9c5.

remove unuse code

4491c16

Merge branch 'main' into deepspeed

fa031f2

Update requirements.txt

9337741

haqishen and others added 8 commits October 18, 2023 20:55

deepspeed==0.11.1

263f48a

Merge branch 'main' into deepspeed

83429b6

Update requirements.txt

882631a

temp fix for deepspeed slow gen

368f0af

Merge branch 'deepspeed' of github.com:h2oai/h2o-llmstudio into deeps…

011e269

…peed

style

d5dbbfb

style

5b8499c

fix

07bb4b2

psinger approved these changes Oct 24, 2023

View reviewed changes

Merge branch 'main' into deepspeed

91562e9

haqishen merged commit 67d3a3c into main Oct 24, 2023
5 checks passed

haqishen deleted the deepspeed branch October 24, 2023 07:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepspeed #288

deepspeed #288

haqishen commented Jul 18, 2023 •

edited

Loading

pascal-pfeiffer left a comment

psinger commented Sep 11, 2023

haqishen commented Sep 26, 2023

psinger commented Oct 9, 2023

haqishen commented Oct 10, 2023

psinger commented Oct 12, 2023

psinger left a comment •

edited

Loading

deepspeed #288

deepspeed #288

Conversation

haqishen commented Jul 18, 2023 • edited Loading

Feature

Experiment Result

Using 3 x RTX6000 (24GB), Batchsize = 1

Full params experiments

LORA experiments

Using 8 x V100 w/ NVLink (16GB), Batchsize = 1

Using 8 x A6000 w/o NVLink (48GB), Batchsize = 1

Using 8 x A100 SMX4

Check

Future Work

pascal-pfeiffer left a comment

Choose a reason for hiding this comment

psinger commented Sep 11, 2023

haqishen commented Sep 26, 2023

psinger commented Oct 9, 2023

haqishen commented Oct 10, 2023

psinger commented Oct 12, 2023

psinger left a comment • edited Loading

Choose a reason for hiding this comment

haqishen commented Jul 18, 2023 •

edited

Loading

psinger left a comment •

edited

Loading