Improving memory utilization of Z2+MoE #2079

siddharth9820 · 2022-07-08T00:45:30Z

Summary

While running experiments, I found that an inordinate amount of memory is used in the gradient upscaling step for high expert to GPU ratios (like 1 or 1/2).

This PR does two things:

It creates an upper limit on the expert parameter group size (see deepspeed/moe/utils.py).
Instead of first upscaling all the gradients and running the optimizer once, this PR makes it such that a combined upscaling+optimizer step is executed on each parameter group one by one. Thus at a given time only fp32 gradients of 1 parameter group can exist in GPU memory. (see deepspeed/runtime/zero/stage_1_and_2.py)

Highlight Result

Prior to this PR a 6.7B base model with 16 experts ran OOM on 32 A100 GPUs (40GB).
With the changes, I am able to run the same model with a peak memory utilization of 31.3 GB. Thus at the bare minimum we are saving 21.75% memory for this model.

Train loss curve for reference

Sanity Checks

Train loss curves before and after the changes match

Batch Times

To create the most pathological case, I set global batch size to 8. Yet, there is no penalty in batch times.

Memory Consumption

The memory saved is expected to increase with increasing model sizes.

siddharth9820 · 2022-07-12T00:03:13Z

deepspeed/moe/utils.py

@@ -59,8 +59,9 @@ def split_params_grads_into_shared_and_expert_params(
    return shared_grads, expert_grads


-def split_params_into_different_moe_groups_for_optimizer(
-        param_groups: Tuple[Dict]) -> Tuple[Dict]:
+def split_params_into_different_moe_groups_for_optimizer(param_groups: Tuple[Dict],


@tjruwase, please let me know how we would want to offer the user a way to set the max_group_size

As discussed, let's add an moe section in ds_config. Perhaps, @awan-10 could help with the design.

@awan-10 Perhaps the moe section can also contain a flag to toggle expert slicing too?

…nto zero2_optim_tiling

siddharth9820 added 4 commits July 8, 2022 01:49

add maximum param group size to moe

47af728

process optimizer groups individually

8d31edb

correct timer placement

e6702d8

tested with 1.3B 2.7B and 6.7B

7dd3da2

siddharth9820 requested review from jeffra, samyam, tjruwase, ShadenSmith, conglongli, awan-10, cli99, eltonzheng, minjiaz, RezaYazdaniAminabadi, duli2012, mrwyattii, yaozhewei, arashb and xiaoxiawu-microsoft as code owners July 8, 2022 00:45

siddharth9820 added 4 commits July 7, 2022 20:50

Merge branch 'master' into zero2_optim_tiling

be0d2fe

correction in DeepSpeedCPUAdam

b02a34f

torch 1.8.0 backwards compatibility

57f64c1

Merge branch 'master' into zero2_optim_tiling

563d3eb

siddharth9820 requested a review from samadejacobs as a code owner July 11, 2022 22:02

siddharth9820 added 2 commits July 12, 2022 04:05

modify optimizer groups dynamically

1d9852c

correction for DSCpuAdam

af4da67

siddharth9820 commented Jul 12, 2022

View reviewed changes

tjruwase approved these changes Jul 12, 2022

View reviewed changes

tjruwase and others added 2 commits July 13, 2022 11:42

Merge branch 'master' into zero2_optim_tiling

2ff9812

restored comments

41a7d16

siddharth9820 added 2 commits July 13, 2022 23:53

Merge branch 'zero2_optim_tiling' of github.com:microsoft/DeepSpeed i…

e9a1cd3

…nto zero2_optim_tiling

remove print statements

3209c3b

siddharth9820 merged commit c1af73f into master Jul 13, 2022

siddharth9820 deleted the zero2_optim_tiling branch July 13, 2022 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving memory utilization of Z2+MoE #2079

Improving memory utilization of Z2+MoE #2079

siddharth9820 commented Jul 8, 2022 •

edited

siddharth9820 Jul 12, 2022 •

edited

tjruwase Jul 12, 2022

siddharth9820 Jul 12, 2022

Improving memory utilization of Z2+MoE #2079

Improving memory utilization of Z2+MoE #2079

Conversation

siddharth9820 commented Jul 8, 2022 • edited

Summary

Highlight Result

Sanity Checks

Batch Times

Memory Consumption

siddharth9820 Jul 12, 2022 • edited

Choose a reason for hiding this comment

tjruwase Jul 12, 2022

Choose a reason for hiding this comment

siddharth9820 Jul 12, 2022

Choose a reason for hiding this comment

siddharth9820 commented Jul 8, 2022 •

edited

siddharth9820 Jul 12, 2022 •

edited