Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving memory utilization of Z2+MoE #2079

Merged
merged 14 commits into from
Jul 13, 2022
Merged

Conversation

siddharth9820
Copy link
Contributor

@siddharth9820 siddharth9820 commented Jul 8, 2022

Summary

While running experiments, I found that an inordinate amount of memory is used in the gradient upscaling step for high expert to GPU ratios (like 1 or 1/2).

This PR does two things:

  • It creates an upper limit on the expert parameter group size (see deepspeed/moe/utils.py).
  • Instead of first upscaling all the gradients and running the optimizer once, this PR makes it such that a combined upscaling+optimizer step is executed on each parameter group one by one. Thus at a given time only fp32 gradients of 1 parameter group can exist in GPU memory. (see deepspeed/runtime/zero/stage_1_and_2.py)

Highlight Result

Prior to this PR a 6.7B base model with 16 experts ran OOM on 32 A100 GPUs (40GB).
With the changes, I am able to run the same model with a peak memory utilization of 31.3 GB. Thus at the bare minimum we are saving 21.75% memory for this model.

Train loss curve for reference
image

Sanity Checks

Train loss curves before and after the changes match

image

image

Batch Times

To create the most pathological case, I set global batch size to 8. Yet, there is no penalty in batch times.

image

Memory Consumption

The memory saved is expected to increase with increasing model sizes.
image

@@ -59,8 +59,9 @@ def split_params_grads_into_shared_and_expert_params(
return shared_grads, expert_grads


def split_params_into_different_moe_groups_for_optimizer(
param_groups: Tuple[Dict]) -> Tuple[Dict]:
def split_params_into_different_moe_groups_for_optimizer(param_groups: Tuple[Dict],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjruwase, please let me know how we would want to offer the user a way to set the max_group_size

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, let's add an moe section in ds_config. Perhaps, @awan-10 could help with the design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awan-10 Perhaps the moe section can also contain a flag to toggle expert slicing too?

@siddharth9820 siddharth9820 merged commit c1af73f into master Jul 13, 2022
@siddharth9820 siddharth9820 deleted the zero2_optim_tiling branch July 13, 2022 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants