Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_VISIBLE_DEVICES isn't being respected / hostfile doesn't quite work for one node #662

Closed
stas00 opened this issue Jan 12, 2021 · 6 comments · Fixed by #868
Closed
Assignees

Comments

@stas00
Copy link
Collaborator

stas00 commented Jan 12, 2021

I'm trying to experiment with DS/1gpu and it's not respecting CUDA_VISIBLE_DEVICES

I run the script as:

CUDA_VISIBLE_DEVICES=1 deepspeed --num_gpus=1 ./finetune_trainer.py ...

but it runs on GPU 0 ignoring CUDA_VISIBLE_DEVICES=1

Then I tried to use deepspeed launcher flags as explained here: https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node and encountered multiple issues there:

  1. I think the --hostfile cl arg in the example are in the wrong place, shouldn't they be right after deepspeed and not in the client's args? that is instead of:
deepspeed <client_entry.py> <client args> \
  --deepspeed --deepspeed_config ds_config.json --hostfile=myhostfile
deepspeed  --hostfile=myhostfile <client_entry.py> <client args> \
  --deepspeed --deepspeed_config ds_config.json

This is a launcher arg and not client arg.

  1. it can't handle hostfile with 1 entry:
$ cat hostfile
worker-1 slots=2
$ deepspeed --hostfile hostfile  ./finetune_trainer.py ...
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/bin/deepspeed", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/bin/deepspeed", line 6, in <module>
    main()
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 259, in main
    resource_pool = fetch_hostfile(args.hostfile)
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 133, in fetch_hostfile
    raise err
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 127, in fetch_hostfile
    hostname, slots = line.split()
ValueError: not enough values to unpack (expected 2, got 0)
  1. it can't handle exclusion or inclusions w/o the hostfile (misleading docs)
    Copy-n-pasting from docs - very last code example:
$ deepspeed --exclude="worker-1:0" ./finetune_trainer.py
Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/bin/deepspeed", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/bin/deepspeed", line 6, in <module>
    main()
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 272, in main
    active_resources = parse_inclusion_exclusion(resource_pool,
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 240, in parse_inclusion_exclusion
    return parse_resource_filter(active_resources,
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 187, in parse_resource_filter
    raise ValueError("Hostname '{}' not found in hostfile".format(hostname))
ValueError: Hostname 'worker-1' not found in hostfile

I think the docs are wrong/misleading - they suggest:

You can instead include or exclude specific resources using the --include and --exclude flags. For example, to use all available resources except GPU 0 on node worker-2 and GPUs 0 and 1 on worker-3:

  • but they don't specify that the hostfile is actually needed.
  • and the error message is misleading since what hostfile is it talking about? I haven't passed it any hostfiles in this experiment and if it found it in the current dir, that hostfile does have worker-1 in it - see cat hostfile earlier. So it should not just say "in hostfile" but in /path/to/hostfile
  • I think in this particular situation it should say: "hostfile hasn't been provided and it's required"
  1. this is not the right solution since it tries to ssh to worker-1
subprocess.CalledProcessError: Command '['ssh worker-1 hostname -I']' returned non-zero exit status 255.

So how does one configure deepspeed to use a specific GPU on a single node?

Thank you!

@jeffra
Copy link
Collaborator

jeffra commented Jan 14, 2021

Thanks for reporting this @stas00. There are a few different things going on here, let me try and address all of them here (if I miss one please let me know haha).

If you do not provide the deepspeed launcher with a hostfile (via -H/--hostfile) it will only launch processes within that local node. We will try to discover all the available GPUs on the box via torch.cuda.device_count(), see here for more details on this logic. Going forward the local node would be referred to as localhost.

The deepspeed launcher was primarily written to simplify multi-node training in interactive training environments. It supports arbitrary inclusion/exclusion of gpus/nodes. For example, I may want to exclude just the 3rd gpu on node 5 (maybe it has ECC errors) out of a total of 128 gpus and 32 nodes. This would be achieved via deepspeed --exclude worker-5:3 train.py.

In order to support this arbitrary inclusion/exclusion our launcher sets the appropriate CUDA_VISIBLE_DEVICES at process launch time on each node. This means that if the user sets their own CUDA_VISIBLE_DEVICES on the launching node it's not clear if they want to set this value on the local node or all nodes. We should update our docs to make this more clear though.

If you wanted to do something like CUDA_VISIBLE_DEVICES=1 deepspeed --num_gpus=1 ./finetune_trainer.py ... I would recommend running this as deepspeed --include localhost:1 ./finetune_trainer.py ....

Now if you wanted to use a hostfile to define your node list (even just 1 node) you can do that like you have. However, after seeing your ValueError stack trace I think you may have a trailing new-line at the end of the file. It seems our code was not tested with this case and it causes it to crash. I've submitted a PR to handle this case gracefully (see #669).

I think maybe one of the key missing pieces in our doc is our decision to reference the local node as localhost if no hostfile is given. So in the case of deepspeed --exclude="worker-1:0" ./finetune_trainer.py if you added your above hostfile I think it should work the way you want.

@stas00
Copy link
Collaborator Author

stas00 commented Jan 15, 2021

As I don't have access to any clusters as of yet, I'm happy to just focus on the direct necessity of being able to specify which gpu to run on.

I would recommend running this as deepspeed --include localhost:1 ./finetune_trainer.py ....

Yes, that worked! Awesome!

Could we please document that or I could make a PR to add it to the launcher doc if it will save your time.


BTW, I tried first:

CUDA_VISIBLE_DEVICES=1  deepspeed --include localhost:1 ./finetune_trainer.py ...

as I had it already in the script and it crashed with:

Traceback (most recent call last):
  File "/home/stas/anaconda3/envs/main-38/bin/deepspeed", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/bin/deepspeed", line 6, in <module>
    main()
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 272, in main
    active_resources = parse_inclusion_exclusion(resource_pool,
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 240, in parse_inclusion_exclusion
    return parse_resource_filter(active_resources,
  File "/mnt/nvme1/code/github/00optimize/DeepSpeed/deepspeed/launcher/runner.py", line 190, in parse_resource_filter
    raise ValueError("No slot '{}' specified on host '{}'".format(
ValueError: No slot '1' specified on host 'localhost'

So ideally the launcher should detect such situation and tell the user to not set CUDA_VISIBLE_DEVICES.

I also tried to add a hostfile with localhost slots=2 and it tries to ssh to it.

I'd say documenting:

deepspeed --include localhost:1

as a way to run a specific single GPU on a local host is all that is needed.


I think you may have a trailing new-line at the end of the file

You're spot on! I used nano and it can't help but leave extraneous new lines. I fixed that and it worked, but your PR is definitely a goodness. Not really needing at the moment but I'd add skipping commented out lines too.


let me try and address all of them here (if I miss one please let me know haha).

And as I mentioned in OP item 1. I think the --hostfile cl arg in the documentation example is in the wrong place

stas00 added a commit to stas00/DeepSpeed that referenced this issue Mar 18, 2021
As discussed in microsoft#662 this PR modifies the doc:
* explains what to use instead of CUDA_VISIBLE_DEVICES
* puts the `--hostfile` cl arg in the correct place in the invocation script

Fixes: microsoft#662
@stas00 stas00 mentioned this issue Mar 18, 2021
jeffra added a commit that referenced this issue Mar 18, 2021
As discussed in #662 this PR modifies the doc:
* explains what to use instead of CUDA_VISIBLE_DEVICES
* puts the `--hostfile` cl arg in the correct place in the invocation script

Fixes: #662

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
sdtblck added a commit to EleutherAI/DeeperSpeed that referenced this issue Apr 6, 2021
* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772)

* fix log(0) & 1/log(1) bugs

* simplify

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* bump to v0.3.12

* Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline doc typos/improvements (microsoft#659)

Admin merging for pure-doc PR that does not trigger build.

* Samyamr/inference hook fix (microsoft#851)

* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off

Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO Stage 2: Clear reduced gradients (microsoft#856)

* Ensure gradients of other partitions are cleared after reduction

* Remove redundant code

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [runner/launch] propagate the error (microsoft#854)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* docs: minor spelling tweaks (microsoft#858)

* Allow args to be optional in deepspeed.initialize (microsoft#825)

* Fix ZeRO3 save_checkpoint (microsoft#857)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Make config objects json serializable (microsoft#862)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* bump version 0.3.13

* 1-bit Adam v2 (microsoft#817)

Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., microsoft#813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 7840085, reversing
changes made to a6dba72.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd98.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* consistent checkpoint filenaming (microsoft#865)

* consistent checkpoint filenaming

* backward compatible rename

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* [doc] launcher (microsoft#868)

As discussed in microsoft#662 this PR modifies the doc:
* explains what to use instead of CUDA_VISIBLE_DEVICES
* puts the `--hostfile` cl arg in the correct place in the invocation script

Fixes: microsoft#662

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline (microsoft#888)

* [doc] pipeline

As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. 

Thank you!

* tweak

* [debug utils] see_memory_usage fixes (microsoft#890)

* see_memory_usage fixes

* didn't expect pt-1.2

* fix the order of things

* fix the order of things

* full fp32 weights reconstruction for zero 2+3 (microsoft#892)

* save_fp16_model consolidated for zero3 (microsoft#893)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861)

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707

As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads

* Trim space

* Trim space

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* update kramdown (microsoft#901)

security alert related to older kramdown version

* update backward api doc (microsoft#903)

* Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905)

Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1.
- [Release notes](https://github.com/gettalong/kramdown/releases)
- [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page)
- [Commits](https://github.com/gettalong/kramdown/commits)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* We're hiring! + integration posts

* [website] We're hiring! + integration posts

* [website] we're hiring!

* zero.Init() clarification (microsoft#880)

* zero.Init() clarification

clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must.

this proposal is via @samyam's clarification shared elsewhere.

Thank you.

* style

* add clarity

* style

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* disable pipe test (microsoft#915)

This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though.

* Add link to AML examples. (microsoft#916)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: brett koonce <koonce@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: hamlet <gvvvv@163.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sid <sidney.black@aleph-alpha.de>
sdtblck added a commit to EleutherAI/DeeperSpeed that referenced this issue Apr 22, 2021
* test sparse self_attn fix

* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772)

* fix log(0) & 1/log(1) bugs

* simplify

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* bump to v0.3.12

* Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline doc typos/improvements (microsoft#659)

Admin merging for pure-doc PR that does not trigger build.

* Samyamr/inference hook fix (microsoft#851)

* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off

Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO Stage 2: Clear reduced gradients (microsoft#856)

* Ensure gradients of other partitions are cleared after reduction

* Remove redundant code

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [runner/launch] propagate the error (microsoft#854)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* docs: minor spelling tweaks (microsoft#858)

* Allow args to be optional in deepspeed.initialize (microsoft#825)

* Fix ZeRO3 save_checkpoint (microsoft#857)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Make config objects json serializable (microsoft#862)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* bump version 0.3.13

* 1-bit Adam v2 (microsoft#817)

Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., microsoft#813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 7840085, reversing
changes made to a6dba72.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd98.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* consistent checkpoint filenaming (microsoft#865)

* consistent checkpoint filenaming

* backward compatible rename

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* [doc] launcher (microsoft#868)

As discussed in microsoft#662 this PR modifies the doc:
* explains what to use instead of CUDA_VISIBLE_DEVICES
* puts the `--hostfile` cl arg in the correct place in the invocation script

Fixes: microsoft#662

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline (microsoft#888)

* [doc] pipeline

As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. 

Thank you!

* tweak

* [debug utils] see_memory_usage fixes (microsoft#890)

* see_memory_usage fixes

* didn't expect pt-1.2

* fix the order of things

* fix the order of things

* full fp32 weights reconstruction for zero 2+3 (microsoft#892)

* save_fp16_model consolidated for zero3 (microsoft#893)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861)

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707

As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads

* Trim space

* Trim space

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* mlperf attn initial commit

* update kramdown (microsoft#901)

security alert related to older kramdown version

* update backward api doc (microsoft#903)

* Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905)

Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1.
- [Release notes](https://github.com/gettalong/kramdown/releases)
- [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page)
- [Commits](https://github.com/gettalong/kramdown/commits)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* We're hiring! + integration posts

* [website] We're hiring! + integration posts

* [website] we're hiring!

* zero.Init() clarification (microsoft#880)

* zero.Init() clarification

clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must.

this proposal is via @samyam's clarification shared elsewhere.

Thank you.

* style

* add clarity

* style

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* disable pipe test (microsoft#915)

This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though.

* Add link to AML examples. (microsoft#916)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* add inference_batch fn

* Add space in help string (microsoft#926)

* Fix for fragmented linear inputs in ZeRO 3 Linear layers where reshap… (microsoft#881)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [zero3] GatheredParameters can now handle a list of params (microsoft#884)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* fix cpu_adam memory leak on deepspeed re-use in the same process (microsoft#896)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [benchmarks] flatten/unflatten benchmarks (microsoft#919)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* improved readability + typos (microsoft#895)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [zero doc] fix misspelled param (microsoft#878)

We really really really need those params to be validated...

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Samyamr/stage 3 skip modules without parameters (microsoft#867)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* docs (microsoft#909)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Supporting different hidden dimensions for transformer kernels-v2 (microsoft#934)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* cleanup, reinstantiate sending of logits / layer_past

* cleanup, reinstantiate sending of logits / layer_past

* bump to 0.3.14

* add pypi badge

* Delete check of pdsh (microsoft#941)

* fix double linear override; spelling (microsoft#954)

* [config] turn exponential notation back on for config dump (microsoft#955)

* e-notation for large floats

* handle ints too

* readability

* handle bool

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* document how to override ~/.cache/torch_extensions (microsoft#959)

* [zero] faster flatten/unflatten (cpp version)  (microsoft#910)

* faster flatten/unflatten with apex

* switch to cpp flatten/unflatten

* style

* better comment

* missing import

* switch to build ops at run time

* fixes

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* update lr scheduler doc for doing per step or epoch update (microsoft#913)

* update lr scheduler doc for doing per step or epoch update

* work

* trigger build

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Fix ZeRO-3 UnboundLocalError (microsoft#968)

* Fix UnboundLocalError

* Get full partition size

* ZeRO-Infinity (microsoft#976)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>

* revert zero-inf change to launcher

* [docs] zero-inf updates

* bump to 0.3.15

* ZeRO-Infinity tutorial additions (microsoft#978)

* zinf tutorial

* more megatron integration docs

* [docs] add ZeRO-Inf news items

* refactor

* ZeRO-Infinity docs (microsoft#979)

* zinf tutorial

* more megatron integration docs

* ZInf + tiling docs

* [docs] zero-inf updates

* assert no Z2/Z3 with pipeline and fix some docs links (microsoft#980)

* add option to force multi-node launcher mode (microsoft#977)

* [ZeRO Infinity] Allow Init to take a dict for the deepspeed config  (microsoft#983)

* Add check to see if json file is already loaded

* Update doc

* Address review

* Remove doc comment

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* make bold+italic work without escaping _ (microsoft#775)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* remove debug prints: (microsoft#986)

* 1-bit LAMB optimizer (microsoft#970)

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed.
Author: @conglongli, @awan-10, @samyam, Hanlin Tang, Yuxiong He
Paper: https://arxiv.org/abs/2104.06069

Co-authored-by: sdtblck <46172032+sdtblck@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Use odd shape tensor to represent parameter data in partitioned state (microsoft#981)

* use wierd shaped tensor to avoid silent failures when not registering externel params

* fix typo

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Make reduce scatter optional for ZeRO-1 as workaround (microsoft#971)

* Make reduce scatter optional for ZeRO-1 as workaround

* Make allreduce default for ZeRO 1

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Fix all Pipeline Module Parameters being sent to cuda:0 (microsoft#687)

* remove communicate overflow (already in utils.CheckOverflow)

Co-authored-by: sid <sidney.black@aleph-alpha.de>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: brett koonce <koonce@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: hamlet <gvvvv@163.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Takuya Makino <takuyamakino15@gmail.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Sean Naren <sean@grid.ai>
@skpig
Copy link
Contributor

skpig commented Feb 10, 2022

@jeffra Thanks. --include local:3 can set CUDA_VISIBLE_DEVICES=3 properly. However, the engine returned by deepspeed.initialize is on the wrong device.

Here is my code:

def debug():
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", default=0, type=int,
                        help="local_rank for distributed training on gpus")
    parser = deepspeed.add_config_arguments(parser)
    args = parser.parse_args()
    os.environ['MASTER_ADDR'] = 'localhost'  #
    os.environ['MASTER_PORT'] = '12345'  #
    os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3'
    args.local_rank = int(os.environ['LOCAL_RANK'])

    model = torch.nn.Linear(10,10)
    engine, _, _, _ = deepspeed.initialize(args=args, model=model, model_parameters=model.parameters())
    print("Engine is on device: ", next(engine.parameters()).device)

And I launch the code above with deepspeed --include="localhost:3" train.py --deepspeed --deepspeed_config config.json
I'm willing to run the model on device 3, but the engine is on device 0.

@stas00
Copy link
Collaborator Author

stas00 commented Feb 10, 2022

@skpig, if I may recommend opening a new Issue - it'd help with tracking and making sure your bug report doesn't fall between the cracks - as this Issue has been closed.

@skpig
Copy link
Contributor

skpig commented Feb 11, 2022

@stas00 Thank you for your advice. I have already opened a new issue #1761.

@oushu1zhangxiangxuan1
Copy link
Contributor

I fixed #4993 of Autotune hostfile not found bug in PR #4996, can this help you?

oushu1zhangxiangxuan1 pushed a commit to oushu1zhangxiangxuan1/DeepSpeed that referenced this issue Jan 24, 2024
github-merge-queue bot pushed a commit that referenced this issue Jan 26, 2024
Test result in single node
![Pasted
Graphic](https://github.com/microsoft/DeepSpeed/assets/39079736/08de35eb-396b-4437-9b76-42962509ef3a)

Co-authored-by: John-Saxon <zhang.xiangxuan@oushu.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this issue Feb 17, 2024
microsoft#4996)

Test result in single node
![Pasted
Graphic](https://github.com/microsoft/DeepSpeed/assets/39079736/08de35eb-396b-4437-9b76-42962509ef3a)

Co-authored-by: John-Saxon <zhang.xiangxuan@oushu.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
rraminen pushed a commit to ROCm/DeepSpeed that referenced this issue May 9, 2024
microsoft#4996)

Test result in single node
![Pasted
Graphic](https://github.com/microsoft/DeepSpeed/assets/39079736/08de35eb-396b-4437-9b76-42962509ef3a)

Co-authored-by: John-Saxon <zhang.xiangxuan@oushu.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants