-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[doc] pipeline doc typos/improvements #659
Conversation
docs/_tutorials/pipeline.md
Outdated
DeepSpeed provides a `LayerSpec` class that delays the construction of | ||
modules until the model layers have been partitioned across workers. Then, | ||
the modules are built on the GPU that owns the layer. | ||
each GPU allocates only the modules assigned to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to communicate that when LayerSpec
is used, not only that each GPU allocates only the layers it's assigned to but also that no CPU allocation happens at all?
Or should I have rewritten it to say:
Then each worker will allocate only the layers it's assigned to. So continuing the example from the previous paragraph, a machine with 16 GPUs will need to allocate a total of 1x model size on its CPU, compared to 16x in the
LayerSpec
example.
I think the two paras have GPU and CPU mixed up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, LayerSpec
does not allocate modules until the partitioning is complete. At that point, each rank allocates only the layers assigned to it and they are then moved to the GPU. This lets us build distributed models that wouldn't fit in CPU memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So is it better to leave the original modification of this PR, or replace with the more exemplifying:
Then each worker will allocate only the layers it's assigned to. So continuing the example from the previous paragraph, a machine with 16 GPUs will need to allocate a total of 1x model size on its CPU, compared to 16x in the LayerSpec example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I went ahead with the longer version. Please double check that it's still correct.
Thanks so much for your ongoing contributions to DeepSpeed! Good questions!
Pipeline parallelism is somewhat rigid in that the input to each layer's From a usability perspective, I'd love to be able to specify that some inputs are static and globally available (e.g.,
Agreed that this is a limitation. As a starting place, we wanted to match
Definitely. I'm trying to release our Megatron GPT implementation asap. The word embeddings are tied because they are used on both the first and last stages. |
Thank you for all the other answers in the previous comment, @ShadenSmith.
OK, can then this example be expanded to show how the I think it'd help if all that code is explicit in the open so that the user can see what's going on and then expand it to support other inputs. I hope you are can see how the provided example isn't helping with a general case. We have a huge list of inputs as it can be seen here in just one example of t5: and you can also see that some of them aren't tensors, but Boolean. I see fairscale has the same limitation:
Is this the case where the pipeline implementation tries to protect itself from having an object as input, since it won't know how to
I do understand that we need each stage in the pipeline to expect |
After studying this more I finally wrapped my head around it. I propose the following change to https://www.deepspeed.ai/tutorials/pipeline/#inputs-and-outputs to make it crystal clear from the get going:
before:
after:
and add a note:
to:
because otherwise Please let me know if you feel this makes things easier to understand and I will expand this PR. |
And as I'm trying to port t5 transformers to
?
Currently in transformers t5 I have 15 distinct input parts, some tensors of BS, some I think I have an idea how to work around this by creating a Pipe wrapper class at run time and initialize it with a late |
Hi @stas00 , thanks a ton for your contibutions and comments. Unfortunately I lost my first reply in a power outage. Here are a few quick comments and clarifications and I'll double back over the day. DeepSpeed does not slice inputs into micro-batches, we just rely on the data loader to do that. We already have the concept of micro-batches for gradient accumulation and build off of that instead. Instead, we just The def forward(self, inputs):
x = inputs
for layer in self.layers:
x = layer(x)
return x This contract lets us split
|
Yes, this one I have already figured out and basically created a wrapper class to which I pass the flow True/False args via its
I think what would help you to help me is to show you exactly what I'm dealing with:
The original logic runs a simple loop, picks the right segment for each block in the loop from the first tuple, then takes chunks of that segment via the nested tuple and eventually gets a 2 tensor slice which it adds to hidden_state in self-attention. here is the structure I need to pass around: In this case we have BS=3, 6 blocks, and each block has 4 past key values stored. As you can see the batch dimension (bs=3 here) that PP needs to slice on is buried 3 levels. So I need to invert that if I want to pass it to PP as part of Also note that none of these have an issue of being switched to the right device, they are all tensors and simple variables, but aren't necessarily w/o a nested python structure around them (tuple or a tuple of tuples). Hope this helps to understand what I'm dealing with and I hope you have some practical ideas. Thank you!
I haven't thought of this - thank you! Oh, I think in fairscale/pytorch pipeline it does slice - i.e. expects the inputs to be of batch size in the first dimension. Let me double check.
Hmm, ok, then I don't think this would work at all in the specific case I'm attempting - I'm currently trying to the use the newly added Pipeline in pytorch which I understand is coming from fairscale. The thing is again trying to make all of
so 2 pipes! and it sounds that this approach won't work with DeepSpeed pipeline. Am I correct? |
let's merge this as is? |
* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772) * fix log(0) & 1/log(1) bugs * simplify Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * bump to v0.3.12 * Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [doc] pipeline doc typos/improvements (microsoft#659) Admin merging for pure-doc PR that does not trigger build. Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com>
modules until the model layers have been partitioned across workers. | ||
Then each worker will allocate only the layers it's assigned to. So, continuing the | ||
example from the previous paragraph, a machine with 16 GPUs will need to allocate a | ||
total of 1x model size on its CPU, compared to 16x in the LayerSpec example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stas00 @ShadenSmith I think there's a typo in this sentence, it should be "compared to 16x in the Sequential example"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, @g-karthik
How about:
So, comparing to the
example from the previous paragraph, a machine with 16 GPUs will need to allocate a
total of 1x model size on its CPU and not 16x.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds fine to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. Thank you!
* [doc] pipeline As @g-karthik flagged in #659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. Thank you! * tweak
* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772) * fix log(0) & 1/log(1) bugs * simplify Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * bump to v0.3.12 * Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [doc] pipeline doc typos/improvements (microsoft#659) Admin merging for pure-doc PR that does not trigger build. * Samyamr/inference hook fix (microsoft#851) * Fix mis-aligned-grad When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that. * Formatting fix * Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size * also removing alignment from flat fp16 buffers * Testing for hidden dim alignment * inference hook fix * Update stage3.py * formatting * [bug-fix] move params to gpu if offload params is turned off Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO Stage 2: Clear reduced gradients (microsoft#856) * Ensure gradients of other partitions are cleared after reduction * Remove redundant code Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [runner/launch] propagate the error (microsoft#854) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * docs: minor spelling tweaks (microsoft#858) * Allow args to be optional in deepspeed.initialize (microsoft#825) * Fix ZeRO3 save_checkpoint (microsoft#857) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Make config objects json serializable (microsoft#862) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump version 0.3.13 * 1-bit Adam v2 (microsoft#817) Authors: @awan-10 @conglongli @samyam @jeffra What's new: NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation. Add support to momentum masks for those parameters with constant zero gradients during training. Bug fixes (e.g., microsoft#813). * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594) * NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593) * add nccl 1-bit optim. * temporary commit to save stuff. * Use dist collectives instead of mpi routines. * remove old code for comm. * Fix bugs. still does not work. * modify to test the nccl side code path * Initial gather impl. Works intra-node. * Updates to comm. phase 2. nccl comm. passed the tests. * refactor code to introduce nccl/mpi as backends for onebit adam. * Refactor updates to test/engine. * Fix compile/runtime errors. * simplify support for nccl/mpi backends. * Add missign file * Add compression backend in constructor. Revert later. * modify test with some perf counting. * Implement a true non-blocking gather for nccl side. * Revert "Add compression backend in constructor. Revert later." This reverts commit df8c40d. * improve the 1-bit adam test. * Refactor comm. and compression backend in 1-bit adam. * Fix the test. * Fix runtime errors and typos in nccl backend * fix mpi backend. modify tests. * modify nccl perf test. * fix mpi side errors. * Add an mpi perf test * Sync DSE. * Remove old collectives file. * Undo a typo. * Graceful failure for torch versions that don't support nccl pt2pt. * Revert "Merge branch 'master' into staging-1bit-nccl-v2" This reverts commit 7840085, reversing changes made to a6dba72. * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2"" This reverts commit 6dbdd98. * comm optimization + 1-bit lamb * Saving/debugging commit. * finalizing 1-bit lamb * finalizing 1-bit lamb * add momentum mask and chkpt handling for 1-bit adam * Cleanup and modify nccl test to be runnable with deepspeed launcher. * Fix format. * fix formatting again. * make test runnable without mpi4py * Add dist.alltoall and dist.allgather instead of custom functions. * remove debug prints. * formatting and renaming * renaming * renaming * add unit test, fix existing tests * skip unit test when torch < 1.8 * revert 1-bit lamb * flatten momentum when dimension is more than 1 * add warning message for 1-bit adam under fp32 * improve version check * add fp32 test * 1-bit adam doc * fix file name * doc fix * torch 1.8 is released * doc fix * fix tests * update news * add doc for momentum mask * fix checkpoing handling, add unit test * checkpoint handling doc * doc final cleanup * bump dates * update tests * url change * doc fix * fix test * doc update Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * consistent checkpoint filenaming (microsoft#865) * consistent checkpoint filenaming * backward compatible rename Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [doc] launcher (microsoft#868) As discussed in microsoft#662 this PR modifies the doc: * explains what to use instead of CUDA_VISIBLE_DEVICES * puts the `--hostfile` cl arg in the correct place in the invocation script Fixes: microsoft#662 Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [doc] pipeline (microsoft#888) * [doc] pipeline As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. Thank you! * tweak * [debug utils] see_memory_usage fixes (microsoft#890) * see_memory_usage fixes * didn't expect pt-1.2 * fix the order of things * fix the order of things * full fp32 weights reconstruction for zero 2+3 (microsoft#892) * save_fp16_model consolidated for zero3 (microsoft#893) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861) * Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707 As some model trainable parameters skipped in training, their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, so they have no norm_for_param_grads * Trim space * Trim space Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * update kramdown (microsoft#901) security alert related to older kramdown version * update backward api doc (microsoft#903) * Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905) Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1. - [Release notes](https://github.com/gettalong/kramdown/releases) - [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page) - [Commits](https://github.com/gettalong/kramdown/commits) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * We're hiring! + integration posts * [website] We're hiring! + integration posts * [website] we're hiring! * zero.Init() clarification (microsoft#880) * zero.Init() clarification clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must. this proposal is via @samyam's clarification shared elsewhere. Thank you. * style * add clarity * style Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * disable pipe test (microsoft#915) This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though. * Add link to AML examples. (microsoft#916) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: brett koonce <koonce@gmail.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: hamlet <gvvvv@163.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: sid <sidney.black@aleph-alpha.de>
* test sparse self_attn fix * [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772) * fix log(0) & 1/log(1) bugs * simplify Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * bump to v0.3.12 * Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [doc] pipeline doc typos/improvements (microsoft#659) Admin merging for pure-doc PR that does not trigger build. * Samyamr/inference hook fix (microsoft#851) * Fix mis-aligned-grad When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that. * Formatting fix * Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size * also removing alignment from flat fp16 buffers * Testing for hidden dim alignment * inference hook fix * Update stage3.py * formatting * [bug-fix] move params to gpu if offload params is turned off Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO Stage 2: Clear reduced gradients (microsoft#856) * Ensure gradients of other partitions are cleared after reduction * Remove redundant code Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [runner/launch] propagate the error (microsoft#854) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * docs: minor spelling tweaks (microsoft#858) * Allow args to be optional in deepspeed.initialize (microsoft#825) * Fix ZeRO3 save_checkpoint (microsoft#857) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Make config objects json serializable (microsoft#862) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump version 0.3.13 * 1-bit Adam v2 (microsoft#817) Authors: @awan-10 @conglongli @samyam @jeffra What's new: NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation. Add support to momentum masks for those parameters with constant zero gradients during training. Bug fixes (e.g., microsoft#813). * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594) * NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593) * add nccl 1-bit optim. * temporary commit to save stuff. * Use dist collectives instead of mpi routines. * remove old code for comm. * Fix bugs. still does not work. * modify to test the nccl side code path * Initial gather impl. Works intra-node. * Updates to comm. phase 2. nccl comm. passed the tests. * refactor code to introduce nccl/mpi as backends for onebit adam. * Refactor updates to test/engine. * Fix compile/runtime errors. * simplify support for nccl/mpi backends. * Add missign file * Add compression backend in constructor. Revert later. * modify test with some perf counting. * Implement a true non-blocking gather for nccl side. * Revert "Add compression backend in constructor. Revert later." This reverts commit df8c40d. * improve the 1-bit adam test. * Refactor comm. and compression backend in 1-bit adam. * Fix the test. * Fix runtime errors and typos in nccl backend * fix mpi backend. modify tests. * modify nccl perf test. * fix mpi side errors. * Add an mpi perf test * Sync DSE. * Remove old collectives file. * Undo a typo. * Graceful failure for torch versions that don't support nccl pt2pt. * Revert "Merge branch 'master' into staging-1bit-nccl-v2" This reverts commit 7840085, reversing changes made to a6dba72. * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2"" This reverts commit 6dbdd98. * comm optimization + 1-bit lamb * Saving/debugging commit. * finalizing 1-bit lamb * finalizing 1-bit lamb * add momentum mask and chkpt handling for 1-bit adam * Cleanup and modify nccl test to be runnable with deepspeed launcher. * Fix format. * fix formatting again. * make test runnable without mpi4py * Add dist.alltoall and dist.allgather instead of custom functions. * remove debug prints. * formatting and renaming * renaming * renaming * add unit test, fix existing tests * skip unit test when torch < 1.8 * revert 1-bit lamb * flatten momentum when dimension is more than 1 * add warning message for 1-bit adam under fp32 * improve version check * add fp32 test * 1-bit adam doc * fix file name * doc fix * torch 1.8 is released * doc fix * fix tests * update news * add doc for momentum mask * fix checkpoing handling, add unit test * checkpoint handling doc * doc final cleanup * bump dates * update tests * url change * doc fix * fix test * doc update Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * consistent checkpoint filenaming (microsoft#865) * consistent checkpoint filenaming * backward compatible rename Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [doc] launcher (microsoft#868) As discussed in microsoft#662 this PR modifies the doc: * explains what to use instead of CUDA_VISIBLE_DEVICES * puts the `--hostfile` cl arg in the correct place in the invocation script Fixes: microsoft#662 Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [doc] pipeline (microsoft#888) * [doc] pipeline As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. Thank you! * tweak * [debug utils] see_memory_usage fixes (microsoft#890) * see_memory_usage fixes * didn't expect pt-1.2 * fix the order of things * fix the order of things * full fp32 weights reconstruction for zero 2+3 (microsoft#892) * save_fp16_model consolidated for zero3 (microsoft#893) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861) * Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707 As some model trainable parameters skipped in training, their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, so they have no norm_for_param_grads * Trim space * Trim space Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * mlperf attn initial commit * update kramdown (microsoft#901) security alert related to older kramdown version * update backward api doc (microsoft#903) * Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905) Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1. - [Release notes](https://github.com/gettalong/kramdown/releases) - [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page) - [Commits](https://github.com/gettalong/kramdown/commits) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * We're hiring! + integration posts * [website] We're hiring! + integration posts * [website] we're hiring! * zero.Init() clarification (microsoft#880) * zero.Init() clarification clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must. this proposal is via @samyam's clarification shared elsewhere. Thank you. * style * add clarity * style Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * disable pipe test (microsoft#915) This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though. * Add link to AML examples. (microsoft#916) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * add inference_batch fn * Add space in help string (microsoft#926) * Fix for fragmented linear inputs in ZeRO 3 Linear layers where reshap… (microsoft#881) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [zero3] GatheredParameters can now handle a list of params (microsoft#884) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fix cpu_adam memory leak on deepspeed re-use in the same process (microsoft#896) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [benchmarks] flatten/unflatten benchmarks (microsoft#919) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * improved readability + typos (microsoft#895) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [zero doc] fix misspelled param (microsoft#878) We really really really need those params to be validated... Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Samyamr/stage 3 skip modules without parameters (microsoft#867) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * docs (microsoft#909) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Supporting different hidden dimensions for transformer kernels-v2 (microsoft#934) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Pull changes from DeepSpeed * Pull changes from DeepSpeed * Pull changes from DeepSpeed * Pull changes from DeepSpeed * Pull changes from DeepSpeed * Pull changes from DeepSpeed * cleanup, reinstantiate sending of logits / layer_past * cleanup, reinstantiate sending of logits / layer_past * bump to 0.3.14 * add pypi badge * Delete check of pdsh (microsoft#941) * fix double linear override; spelling (microsoft#954) * [config] turn exponential notation back on for config dump (microsoft#955) * e-notation for large floats * handle ints too * readability * handle bool Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * document how to override ~/.cache/torch_extensions (microsoft#959) * [zero] faster flatten/unflatten (cpp version) (microsoft#910) * faster flatten/unflatten with apex * switch to cpp flatten/unflatten * style * better comment * missing import * switch to build ops at run time * fixes Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * update lr scheduler doc for doing per step or epoch update (microsoft#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix ZeRO-3 UnboundLocalError (microsoft#968) * Fix UnboundLocalError * Get full partition size * ZeRO-Infinity (microsoft#976) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> * revert zero-inf change to launcher * [docs] zero-inf updates * bump to 0.3.15 * ZeRO-Infinity tutorial additions (microsoft#978) * zinf tutorial * more megatron integration docs * [docs] add ZeRO-Inf news items * refactor * ZeRO-Infinity docs (microsoft#979) * zinf tutorial * more megatron integration docs * ZInf + tiling docs * [docs] zero-inf updates * assert no Z2/Z3 with pipeline and fix some docs links (microsoft#980) * add option to force multi-node launcher mode (microsoft#977) * [ZeRO Infinity] Allow Init to take a dict for the deepspeed config (microsoft#983) * Add check to see if json file is already loaded * Update doc * Address review * Remove doc comment Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * make bold+italic work without escaping _ (microsoft#775) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * remove debug prints: (microsoft#986) * 1-bit LAMB optimizer (microsoft#970) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. Author: @conglongli, @awan-10, @samyam, Hanlin Tang, Yuxiong He Paper: https://arxiv.org/abs/2104.06069 Co-authored-by: sdtblck <46172032+sdtblck@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Use odd shape tensor to represent parameter data in partitioned state (microsoft#981) * use wierd shaped tensor to avoid silent failures when not registering externel params * fix typo Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Make reduce scatter optional for ZeRO-1 as workaround (microsoft#971) * Make reduce scatter optional for ZeRO-1 as workaround * Make allreduce default for ZeRO 1 Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix all Pipeline Module Parameters being sent to cuda:0 (microsoft#687) * remove communicate overflow (already in utils.CheckOverflow) Co-authored-by: sid <sidney.black@aleph-alpha.de> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: brett koonce <koonce@gmail.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: hamlet <gvvvv@163.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Takuya Makino <takuyamakino15@gmail.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Sean Naren <sean@grid.ai>
* set adamw_mode default true (follows FusedAdam and < 0.3.11 logic) (microsoft#844) * less scary overflow notice (microsoft#833) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Add optimizers and schedules to RTD and updated the corresponding part in the website (microsoft#799) * add optimizers and schedules to rtd * update ds website and fix links * add optimizers and schedules to rtd * update ds website and fix links * add flops profiler to rtd * fix Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> * small tweaks (microsoft#839) * Control ZeRO wall clock timers (microsoft#849) * Control ZeRO wall clock timers * Disable more ZeRO3 debug prints Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772) * fix log(0) & 1/log(1) bugs * simplify Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * bump to v0.3.12 * Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [doc] pipeline doc typos/improvements (microsoft#659) Admin merging for pure-doc PR that does not trigger build. * Samyamr/inference hook fix (microsoft#851) * Fix mis-aligned-grad When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that. * Formatting fix * Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size * also removing alignment from flat fp16 buffers * Testing for hidden dim alignment * inference hook fix * Update stage3.py * formatting * [bug-fix] move params to gpu if offload params is turned off Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO Stage 2: Clear reduced gradients (microsoft#856) * Ensure gradients of other partitions are cleared after reduction * Remove redundant code Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Squash stage3 v1 (microsoft#146) Co-authored-by: Samyam <samyamr@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: eltonzheng <eltonz@microsoft.com> * formatting fix (microsoft#150) * stage3 bugfix (API) update and simplified FP16 Z3 tests (microsoft#151) * fp16 Z3 API update and bugfix * revert debug change * docs * filling in allocation docs * better assumption docs * doc progress * config json * major docs edits * auto registration works for accessed cases * working on small models. * debugging large-model discovery? * fix discovery to first forward pass? * return obj ext param * support None parameters in auto-discovery Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: eltonzheng <eltonz@microsoft.com>
This PR fixes a few typos, rewrites a few sentences to hopefully be easier to understand.
And after reading this doc I have a questions:
It's not clear from the tutorial why
TransformerBlockPipe
needs to returnmask
. And if there is some hidden reason, shouldn't it then returnhidden
too?Why do you need the vars to be a single tensor or a tuple? This surely could be a problematic limitation since some layers could have forward args which aren't tensors. Perhaps a more dynamic approach could be used instead? e.g. with the automatic data remapper I posted here: rfc: automating the switching of inputs to the device of the params pytorch/pytorch#49961 (comment)
wrt Tied Layers - could you please give an example? or at least a pointer to where it can be seen in action?
Please let me know if it'd be better to separate the questions into an Issue and not mix it with the PR. I was just thinking that perhaps the answers could be integrated into the tutorials, hence posted them here.
Thank you!