Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] pipeline doc typos/improvements #659

Merged
merged 3 commits into from
Mar 14, 2021
Merged

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Jan 11, 2021

This PR fixes a few typos, rewrites a few sentences to hopefully be easier to understand.

And after reading this doc I have a questions:

  1. It's not clear from the tutorial why TransformerBlockPipe needs to return mask. And if there is some hidden reason, shouldn't it then return hidden too?

  2. Why do you need the vars to be a single tensor or a tuple? This surely could be a problematic limitation since some layers could have forward args which aren't tensors. Perhaps a more dynamic approach could be used instead? e.g. with the automatic data remapper I posted here: rfc: automating the switching of inputs to the device of the params pytorch/pytorch#49961 (comment)

  3. wrt Tied Layers - could you please give an example? or at least a pointer to where it can be seen in action?

Please let me know if it'd be better to separate the questions into an Issue and not mix it with the PR. I was just thinking that perhaps the answers could be integrated into the tutorials, hence posted them here.

Thank you!

Comment on lines 278 to 280
DeepSpeed provides a `LayerSpec` class that delays the construction of
modules until the model layers have been partitioned across workers. Then,
the modules are built on the GPU that owns the layer.
each GPU allocates only the modules assigned to it.
Copy link
Contributor Author

@stas00 stas00 Jan 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to communicate that when LayerSpec is used, not only that each GPU allocates only the layers it's assigned to but also that no CPU allocation happens at all?

Or should I have rewritten it to say:

Then each worker will allocate only the layers it's assigned to. So continuing the example from the previous paragraph, a machine with 16 GPUs will need to allocate a total of 1x model size on its CPU, compared to 16x in the LayerSpec example.

I think the two paras have GPU and CPU mixed up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, LayerSpec does not allocate modules until the partitioning is complete. At that point, each rank allocates only the layers assigned to it and they are then moved to the GPU. This lets us build distributed models that wouldn't fit in CPU memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is it better to leave the original modification of this PR, or replace with the more exemplifying:

Then each worker will allocate only the layers it's assigned to. So continuing the example from the previous paragraph, a machine with 16 GPUs will need to allocate a total of 1x model size on its CPU, compared to 16x in the LayerSpec example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I went ahead with the longer version. Please double check that it's still correct.

docs/_tutorials/pipeline.md Outdated Show resolved Hide resolved
@ShadenSmith
Copy link
Contributor

Thanks so much for your ongoing contributions to DeepSpeed!

Good questions!

It's not clear from the tutorial why TransformerBlockPipe needs to return mask. And if there is some hidden reason, shouldn't it then return hidden too?

Pipeline parallelism is somewhat rigid in that the input to each layer's forward() is always the output of the previous layer. When you have a stack of transformers, we need to return the mask in order to provide it to the next layer. We don't return hidden because it's not needed by the next layer. The next layer uses output as its hidden input.

From a usability perspective, I'd love to be able to specify that some inputs are static and globally available (e.g., mask) so that we wouldn't need to forward it down the pipeline. No concrete plans for it yet, though.

Why do you need the vars to be a single tensor or a tuple?

Agreed that this is a limitation. As a starting place, we wanted to match torch.nn.Sequential functionality and that container has the same restriction. I'd like to be more flexible. The primary limitation is the communication code at stage boundaries. I've been working on a more flexible communication manager that would allow something like this and also remove some hacky code in the pipeline engine.

wrt Tied Layers - could you please give an example?

Definitely. I'm trying to release our Megatron GPT implementation asap. The word embeddings are tied because they are used on both the first and last stages.

@stas00
Copy link
Contributor Author

stas00 commented Jan 15, 2021

Thank you for all the other answers in the previous comment, @ShadenSmith.

It's not clear from the tutorial why TransformerBlockPipe needs to return mask. And if there is some hidden reason, shouldn't it then return hidden too?

Pipeline parallelism is somewhat rigid in that the input to each layer's forward() is always the output of the previous layer. When you have a stack of transformers, we need to return the mask in order to provide it to the next layer. We don't return hidden because it's not needed by the next layer. The next layer uses output as its hidden input.

OK, can then this example be expanded to show how the mask is fed to forward on subsequent calls in the pipeline?
https://www.deepspeed.ai/tutorials/pipeline/#inputs-and-outputs
The issue here is that there could be other inputs and it's not clear at all how the user should handle those. i.e. you're saying "The next layer uses output as its hidden input." but this is not generic enough.

I think it'd help if all that code is explicit in the open so that the user can see what's going on and then expand it to support other inputs.

I hope you are can see how the provided example isn't helping with a general case.

We have a huge list of inputs as it can be seen here in just one example of t5:

https://github.com/huggingface/transformers/blob/c60e0e1ee45f4bf1017736b146c51729f120bb83/src/transformers/models/t5/modeling_t5.py#L600-L612

and you can also see that some of them aren't tensors, but Boolean.

I see fairscale has the same limitation:
https://fairscale.readthedocs.io/en/latest/api/nn/pipe.html

Input and output have to be a Tensor or a tuple of tensors. This restriction is applied at partition boundaries too.

Is this the case where the pipeline implementation tries to protect itself from having an object as input, since it won't know how to .to() such input to the correct device. But it shouldn't be a problem with a simple non-tensor variable or a list/dict of tensors and/or simple variables/list/dict/tuple. Am I missing something? As I mentioned I wrote a helper function that handles any such types transparently:

def recursive_to(device, item):
    """
    Switch any tensors found in `item` to `device`.
    Currently can handle a single tensor, or any of the nested list, tuple and dict structures.
    """

    if torch.is_tensor(item):
        return item.to(device)

    elif isinstance(item, list):
        for i, x in enumerate(item):
            item[i] = recursive_to(device, x)
        return item

    elif isinstance(item, tuple):
        return tuple(recursive_to(device, list(item)))

    elif isinstance(item, dict):
        for k, v in item.items():
            item[k] = recursive_to(device, v)
        return item

    else:
        return item

I do understand that we need each stage in the pipeline to expect self + input and return a single output, so that it stacks up. The key is that if the input or output is a tuple it should be able to contain non-tensor elements I believe.

@stas00
Copy link
Contributor Author

stas00 commented Jan 16, 2021

After studying this more I finally wrapped my head around it.

I propose the following change to https://www.deepspeed.ai/tutorials/pipeline/#inputs-and-outputs to make it crystal clear from the get going:

  1. Rewrite the code as:

before:

class TransformerBlock(nn.Module)
    ...
    def forward(self, hidden, mask):
        hidden = self.compute(hidden, mask)
        return hidden

after:

class TransformerBlockPipe(TransformerBlock)
    def forward(self, inputs):
        hidden, mask = inputs
        hidden = super().forward(hidden, mask)
        return (hidden, mask)

and add a note:

It's clear to see how the output of the previous stage in the pipeline becomes the input to next stage, since the previous stage in the pipeline now returns a tuple of (hidden, mask) and the next stage expects besides self, a tuple of (hidden, mask).

  1. One more piece is missing from that section of the tutorial. Since the original code expected just one non-tuple variable returning from TransformerBlock.forward you have to modify the outmost pre-pipe caller from:
hidden = self.transformer_block(hidden, mask)

to:

hidden, _ = self.transformer_block_pipe(hidden, mask)

because otherwise hidden will contain a tuple, and not the original single variable.

Please let me know if you feel this makes things easier to understand and I will expand this PR.

@stas00
Copy link
Contributor Author

stas00 commented Jan 18, 2021

And as I'm trying to port t5 transformers to Pipe I see why there is a requirement of all inputs to be Tensors - because they have to be spliced into micro-batches.

  1. If I got it right - they shouldn't just be tensors, they should be tensors with first dimension of batch size. I actually had a tensor that was tracking things, but it's a no-go, since it's of a wrong shape. So this pre-requisite can and probably should be clarified.

  2. Surely, there should be a way to handle booleans here. So the input parts that are of batch size will be spliced into micro-batches, but parts that are on/off switches can be probably left alone - hmm, but then it might be difficult to put them back into outputs correctly - perhaps the inputs should have an optional leave_me_alone_inputs which should be replicated as is on micro-batch splicing? So the sig would be:

def forward(self, inputs, leave_me_alone_inputs):
    ...
    return (inputs, leave_me_alone_inputs)

?
This would also allow passing random other inputs that don't need to be micro-batch-sliced.

  1. Same issue happens to optional input parts, which could be None or a tensor. The Pipe needs to slice the tensor, but leave None alone.

Currently in transformers t5 I have 15 distinct input parts, some tensors of BS, some None, some Bool, some tensors that aren't of BS.

I think I have an idea how to work around this by creating a Pipe wrapper class at run time and initialize it with a late __init__ to overcome the need to pass flags and states via forward. I will probably have to come up with some closure to keep global state of the pipe. I will keep you posted.

@ShadenSmith
Copy link
Contributor

Hi @stas00 , thanks a ton for your contibutions and comments. Unfortunately I lost my first reply in a power outage. Here are a few quick comments and clarifications and I'll double back over the day.

DeepSpeed does not slice inputs into micro-batches, we just rely on the data loader to do that. We already have the concept of micro-batches for gradient accumulation and build off of that instead. Instead, we just next() the provided data iterator gradient_accumulation_steps number of times.

The forward() input limitations are partly a trade-off of pipeline parallelism, and partly engineering limitations. Pipeline modules use this forward implementation implicitly:

def forward(self, inputs):
    x = inputs
    for layer in self.layers:
        x = layer(x)
    return x

This contract lets us split self.layers and place them on different machines. There are two consequences here:

  • Inputs like booleans, etc. still need to "flow" into the layer unless they are default valued keyward args. That's where something like the TransformerBlockPipe subclass comes in, we can wrap the functionality of an existing module and just specialize the call into forward with any additional args. Within the pipeline, there's no call site since it's done implicitly by the engine.
  • Successive layers could reside on different machines, and so any inputs must be communicated by NCCL. This is why we are limited to tensor inputs/outputs today. I agree that your recursive to() would help in a non-distributed environment, but we need a bit more in order to send/recv (also, the receiving processes do not know the shape/type of data to receive without an initial handshake). Maybe we could something like pickle more general data structures and send them as ByteTensors ?

@stas00
Copy link
Contributor Author

stas00 commented Jan 20, 2021

  • Inputs like booleans, etc. still need to "flow" into the layer unless they are default valued keyward args. That's where something like the TransformerBlockPipe subclass comes in, we can wrap the functionality of an existing module and just specialize the call into forward with any additional args. Within the pipeline, there's no call site since it's done implicitly by the engine.

Yes, this one I have already figured out and basically created a wrapper class to which I pass the flow True/False args via its __init__. So we are good here. Thank you!

  • Successive layers could reside on different machines, and so any inputs must be communicated by NCCL. This is why we are limited to tensor inputs/outputs today. I agree that your recursive to() would help in a non-distributed environment, but we need a bit more in order to send/recv (also, the receiving processes do not know the shape/type of data to receive without an initial handshake). Maybe we could something like pickle more general data structures and send them as ByteTensors ?

I think what would help you to help me is to show you exactly what I'm dealing with:

  1. I have a few simple things like:

    None, which may become a tensor inside the stages of the pipe - but it shouldn't be sliced or reconstructed on the way out.

    There really should be a None type of tensor

  2. Then I have a medium complexity things like:

    (). which gets filled out by each stage - e.g. it accumulates all hidden_states, attentions, etc - I have 5 of these in t5

    This one is difficult since now instead of accumulating say 6 values (for a stack of 6 blocks), I end up with 12, 18 or more depending on the number of chunks and so I'm attempting to re-construct them once we are out of the pipeline. Not simple but doable.

    Same here - it shouldn't be sliced or reconstructed on the way out.

    What I'm attempting at the moment is a closure, that I pass to the wrapper's __init__ and grow these accumulators via the closure.

  3. And the one I'm fighting with at the moment is this:

    past_key_values which is tuple of tuples of tensors, and the batch dimension is not even on the top level.

    This one needs to be sliced but on the inside of a tuple, but since it's a tuple of a tuple I can't pass it with input - I'm trying to pass it to wrapper's __init__ instead and to do my own slicing from inside the wrapper, by deriving the size of the microbatch and counting each invocation (and hoping they are not async).

    I'm thinking of adding some encoding logic where I convert the tuple of a tuple into a new stack of tensors, pushing the batch dimension to the front so that it get sliced correctly and then inside forward pick out the section that is relevant to that block. pretty nuts.

The original logic runs a simple loop, picks the right segment for each block in the loop from the first tuple, then takes chunks of that segment via the nested tuple and eventually gets a 2 tensor slice which it adds to hidden_state in self-attention. here is the structure I need to pass around:

snapshot_4

In this case we have BS=3, 6 blocks, and each block has 4 past key values stored. As you can see the batch dimension (bs=3 here) that PP needs to slice on is buried 3 levels. So I need to invert that if I want to pass it to PP as part of input.

Also note that none of these have an issue of being switched to the right device, they are all tensors and simple variables, but aren't necessarily w/o a nested python structure around them (tuple or a tuple of tuples).

Hope this helps to understand what I'm dealing with and I hope you have some practical ideas.

Thank you!

Maybe we could something like pickle more general data structures and send them as ByteTensors ?

I haven't thought of this - thank you!

Oh, I think in fairscale/pytorch pipeline it does slice - i.e. expects the inputs to be of batch size in the first dimension. Let me double check.

DeepSpeed does not slice inputs into micro-batches, we just rely on the data loader to do that.

Hmm, ok, then I don't think this would work at all in the specific case I'm attempting - I'm currently trying to the use the newly added Pipeline in pytorch which I understand is coming from fairscale. The thing is again trying to make all of transformers t5 layers to be fully a pipeline would be very difficult, so what I'm trying right now is a hybrid approach of using the pipe in 2 places specifically where we have stacks. So it looks something like:

T5Model
  Encoder -> T5Stack -> Pipe
  Decoder -> T5Stack -> Pipe

so 2 pipes!

and it sounds that this approach won't work with DeepSpeed pipeline.

Am I correct?

@stas00
Copy link
Contributor Author

stas00 commented Mar 13, 2021

let's merge this as is?

@ShadenSmith ShadenSmith merged commit 73d762c into microsoft:master Mar 14, 2021
StellaAthena added a commit to EleutherAI/DeeperSpeed that referenced this pull request Mar 15, 2021
* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772)

* fix log(0) & 1/log(1) bugs

* simplify

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* bump to v0.3.12

* Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline doc typos/improvements (microsoft#659)

Admin merging for pure-doc PR that does not trigger build.

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
modules until the model layers have been partitioned across workers.
Then each worker will allocate only the layers it's assigned to. So, continuing the
example from the previous paragraph, a machine with 16 GPUs will need to allocate a
total of 1x model size on its CPU, compared to 16x in the LayerSpec example.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stas00 @ShadenSmith I think there's a typo in this sentence, it should be "compared to 16x in the Sequential example"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, @g-karthik

How about:

So, comparing to the
example from the previous paragraph, a machine with 16 GPUs will need to allocate a
total of 1x model size on its CPU and not 16x.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine to me!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stas00 stas00 deleted the typos branch March 24, 2021 05:08
stas00 added a commit to stas00/DeepSpeed that referenced this pull request Mar 24, 2021
As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. 

Thank you!
@stas00 stas00 mentioned this pull request Mar 24, 2021
ShadenSmith pushed a commit that referenced this pull request Mar 24, 2021
* [doc] pipeline

As @g-karthik flagged in #659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. 

Thank you!

* tweak
sdtblck added a commit to EleutherAI/DeeperSpeed that referenced this pull request Apr 6, 2021
* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772)

* fix log(0) & 1/log(1) bugs

* simplify

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* bump to v0.3.12

* Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline doc typos/improvements (microsoft#659)

Admin merging for pure-doc PR that does not trigger build.

* Samyamr/inference hook fix (microsoft#851)

* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off

Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO Stage 2: Clear reduced gradients (microsoft#856)

* Ensure gradients of other partitions are cleared after reduction

* Remove redundant code

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [runner/launch] propagate the error (microsoft#854)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* docs: minor spelling tweaks (microsoft#858)

* Allow args to be optional in deepspeed.initialize (microsoft#825)

* Fix ZeRO3 save_checkpoint (microsoft#857)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Make config objects json serializable (microsoft#862)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* bump version 0.3.13

* 1-bit Adam v2 (microsoft#817)

Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., microsoft#813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 7840085, reversing
changes made to a6dba72.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd98.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* consistent checkpoint filenaming (microsoft#865)

* consistent checkpoint filenaming

* backward compatible rename

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* [doc] launcher (microsoft#868)

As discussed in microsoft#662 this PR modifies the doc:
* explains what to use instead of CUDA_VISIBLE_DEVICES
* puts the `--hostfile` cl arg in the correct place in the invocation script

Fixes: microsoft#662

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline (microsoft#888)

* [doc] pipeline

As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. 

Thank you!

* tweak

* [debug utils] see_memory_usage fixes (microsoft#890)

* see_memory_usage fixes

* didn't expect pt-1.2

* fix the order of things

* fix the order of things

* full fp32 weights reconstruction for zero 2+3 (microsoft#892)

* save_fp16_model consolidated for zero3 (microsoft#893)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861)

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707

As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads

* Trim space

* Trim space

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* update kramdown (microsoft#901)

security alert related to older kramdown version

* update backward api doc (microsoft#903)

* Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905)

Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1.
- [Release notes](https://github.com/gettalong/kramdown/releases)
- [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page)
- [Commits](https://github.com/gettalong/kramdown/commits)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* We're hiring! + integration posts

* [website] We're hiring! + integration posts

* [website] we're hiring!

* zero.Init() clarification (microsoft#880)

* zero.Init() clarification

clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must.

this proposal is via @samyam's clarification shared elsewhere.

Thank you.

* style

* add clarity

* style

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* disable pipe test (microsoft#915)

This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though.

* Add link to AML examples. (microsoft#916)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: brett koonce <koonce@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: hamlet <gvvvv@163.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sid <sidney.black@aleph-alpha.de>
sdtblck added a commit to EleutherAI/DeeperSpeed that referenced this pull request Apr 22, 2021
* test sparse self_attn fix

* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772)

* fix log(0) & 1/log(1) bugs

* simplify

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* bump to v0.3.12

* Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline doc typos/improvements (microsoft#659)

Admin merging for pure-doc PR that does not trigger build.

* Samyamr/inference hook fix (microsoft#851)

* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off

Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO Stage 2: Clear reduced gradients (microsoft#856)

* Ensure gradients of other partitions are cleared after reduction

* Remove redundant code

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [runner/launch] propagate the error (microsoft#854)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* docs: minor spelling tweaks (microsoft#858)

* Allow args to be optional in deepspeed.initialize (microsoft#825)

* Fix ZeRO3 save_checkpoint (microsoft#857)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Make config objects json serializable (microsoft#862)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* bump version 0.3.13

* 1-bit Adam v2 (microsoft#817)

Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., microsoft#813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 7840085, reversing
changes made to a6dba72.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd98.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* consistent checkpoint filenaming (microsoft#865)

* consistent checkpoint filenaming

* backward compatible rename

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* [doc] launcher (microsoft#868)

As discussed in microsoft#662 this PR modifies the doc:
* explains what to use instead of CUDA_VISIBLE_DEVICES
* puts the `--hostfile` cl arg in the correct place in the invocation script

Fixes: microsoft#662

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline (microsoft#888)

* [doc] pipeline

As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. 

Thank you!

* tweak

* [debug utils] see_memory_usage fixes (microsoft#890)

* see_memory_usage fixes

* didn't expect pt-1.2

* fix the order of things

* fix the order of things

* full fp32 weights reconstruction for zero 2+3 (microsoft#892)

* save_fp16_model consolidated for zero3 (microsoft#893)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861)

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707

As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads

* Trim space

* Trim space

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* mlperf attn initial commit

* update kramdown (microsoft#901)

security alert related to older kramdown version

* update backward api doc (microsoft#903)

* Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905)

Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1.
- [Release notes](https://github.com/gettalong/kramdown/releases)
- [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page)
- [Commits](https://github.com/gettalong/kramdown/commits)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* We're hiring! + integration posts

* [website] We're hiring! + integration posts

* [website] we're hiring!

* zero.Init() clarification (microsoft#880)

* zero.Init() clarification

clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must.

this proposal is via @samyam's clarification shared elsewhere.

Thank you.

* style

* add clarity

* style

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* disable pipe test (microsoft#915)

This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though.

* Add link to AML examples. (microsoft#916)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* add inference_batch fn

* Add space in help string (microsoft#926)

* Fix for fragmented linear inputs in ZeRO 3 Linear layers where reshap… (microsoft#881)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [zero3] GatheredParameters can now handle a list of params (microsoft#884)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* fix cpu_adam memory leak on deepspeed re-use in the same process (microsoft#896)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [benchmarks] flatten/unflatten benchmarks (microsoft#919)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* improved readability + typos (microsoft#895)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [zero doc] fix misspelled param (microsoft#878)

We really really really need those params to be validated...

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Samyamr/stage 3 skip modules without parameters (microsoft#867)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* docs (microsoft#909)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Supporting different hidden dimensions for transformer kernels-v2 (microsoft#934)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* cleanup, reinstantiate sending of logits / layer_past

* cleanup, reinstantiate sending of logits / layer_past

* bump to 0.3.14

* add pypi badge

* Delete check of pdsh (microsoft#941)

* fix double linear override; spelling (microsoft#954)

* [config] turn exponential notation back on for config dump (microsoft#955)

* e-notation for large floats

* handle ints too

* readability

* handle bool

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* document how to override ~/.cache/torch_extensions (microsoft#959)

* [zero] faster flatten/unflatten (cpp version)  (microsoft#910)

* faster flatten/unflatten with apex

* switch to cpp flatten/unflatten

* style

* better comment

* missing import

* switch to build ops at run time

* fixes

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* update lr scheduler doc for doing per step or epoch update (microsoft#913)

* update lr scheduler doc for doing per step or epoch update

* work

* trigger build

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Fix ZeRO-3 UnboundLocalError (microsoft#968)

* Fix UnboundLocalError

* Get full partition size

* ZeRO-Infinity (microsoft#976)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>

* revert zero-inf change to launcher

* [docs] zero-inf updates

* bump to 0.3.15

* ZeRO-Infinity tutorial additions (microsoft#978)

* zinf tutorial

* more megatron integration docs

* [docs] add ZeRO-Inf news items

* refactor

* ZeRO-Infinity docs (microsoft#979)

* zinf tutorial

* more megatron integration docs

* ZInf + tiling docs

* [docs] zero-inf updates

* assert no Z2/Z3 with pipeline and fix some docs links (microsoft#980)

* add option to force multi-node launcher mode (microsoft#977)

* [ZeRO Infinity] Allow Init to take a dict for the deepspeed config  (microsoft#983)

* Add check to see if json file is already loaded

* Update doc

* Address review

* Remove doc comment

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* make bold+italic work without escaping _ (microsoft#775)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* remove debug prints: (microsoft#986)

* 1-bit LAMB optimizer (microsoft#970)

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed.
Author: @conglongli, @awan-10, @samyam, Hanlin Tang, Yuxiong He
Paper: https://arxiv.org/abs/2104.06069

Co-authored-by: sdtblck <46172032+sdtblck@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Use odd shape tensor to represent parameter data in partitioned state (microsoft#981)

* use wierd shaped tensor to avoid silent failures when not registering externel params

* fix typo

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Make reduce scatter optional for ZeRO-1 as workaround (microsoft#971)

* Make reduce scatter optional for ZeRO-1 as workaround

* Make allreduce default for ZeRO 1

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Fix all Pipeline Module Parameters being sent to cuda:0 (microsoft#687)

* remove communicate overflow (already in utils.CheckOverflow)

Co-authored-by: sid <sidney.black@aleph-alpha.de>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: brett koonce <koonce@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: hamlet <gvvvv@163.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Takuya Makino <takuyamakino15@gmail.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Sean Naren <sean@grid.ai>
jeffra added a commit to jeffra/DeepSpeed that referenced this pull request Aug 25, 2021
* set adamw_mode default true (follows FusedAdam and < 0.3.11 logic) (microsoft#844)

* less scary overflow notice (microsoft#833)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Add optimizers and schedules to RTD and updated the corresponding part in the website (microsoft#799)

* add optimizers and schedules to rtd

* update ds website and fix links

* add optimizers and schedules to rtd

* update ds website and fix links

* add flops profiler to rtd

* fix

Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>

* small tweaks (microsoft#839)

* Control ZeRO wall clock timers (microsoft#849)

* Control ZeRO wall clock timers

* Disable more ZeRO3 debug prints

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772)

* fix log(0) & 1/log(1) bugs

* simplify

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* bump to v0.3.12

* Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline doc typos/improvements (microsoft#659)

Admin merging for pure-doc PR that does not trigger build.

* Samyamr/inference hook fix (microsoft#851)

* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off

Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO Stage 2: Clear reduced gradients (microsoft#856)

* Ensure gradients of other partitions are cleared after reduction

* Remove redundant code

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Squash stage3 v1 (microsoft#146)

Co-authored-by: Samyam <samyamr@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>

* formatting fix (microsoft#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (microsoft#151)

* fp16 Z3 API update and bugfix

* revert debug change

* docs

* filling in allocation docs

* better assumption docs

* doc progress

* config json

* major docs edits

* auto registration works for accessed cases

* working on small models.

* debugging large-model discovery?

* fix discovery to first forward pass?

* return obj ext param

* support None parameters in auto-discovery

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants