Skip to content

Optimizing efficiency using deepspeed #1939

Answered by tjruwase
base-y asked this question in Q&A
Discussion options

You must be logged in to vote

The ZeRO optimizations in DeepSpeed are most helpful when:

  1. Model is too large to train using data parallelism alone.
  2. Larger batch sizes can improve compute efficiency without hurting model performance.

The memory saving of ZeRO is a trade-off for increased communication. The memory saving and communication overhead increase with ZeRO stages. The communication overhead can hurt throughput of smaller models like t5-base, which don't benefit much from the memory savings. In such cases, it is probably better to disable ZeRO by setting the stage to 0. You might find the Flops Profiler or Autotuner helpful for your investigation.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by base-y
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants