Handle missing optional configuration fields correctly #24

tjruwase · 2020-02-05T22:00:49Z

Avoid crashing when optimizer field is missing

Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

…_scheduler Optional loading optimizer and lr scheduler states

IFU-master-2021-07-02

@awan-10

* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772) * fix log(0) & 1/log(1) bugs * simplify Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * bump to v0.3.12 * Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [doc] pipeline doc typos/improvements (microsoft#659) Admin merging for pure-doc PR that does not trigger build. * Samyamr/inference hook fix (microsoft#851) * Fix mis-aligned-grad When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that. * Formatting fix * Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size * also removing alignment from flat fp16 buffers * Testing for hidden dim alignment * inference hook fix * Update stage3.py * formatting * [bug-fix] move params to gpu if offload params is turned off Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO Stage 2: Clear reduced gradients (microsoft#856) * Ensure gradients of other partitions are cleared after reduction * Remove redundant code Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [runner/launch] propagate the error (microsoft#854) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * docs: minor spelling tweaks (microsoft#858) * Allow args to be optional in deepspeed.initialize (microsoft#825) * Fix ZeRO3 save_checkpoint (microsoft#857) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Make config objects json serializable (microsoft#862) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump version 0.3.13 * 1-bit Adam v2 (microsoft#817) Authors: @awan-10 @conglongli @samyam @jeffra What's new: NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation. Add support to momentum masks for those parameters with constant zero gradients during training. Bug fixes (e.g., microsoft#813). * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594) * NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593) * add nccl 1-bit optim. * temporary commit to save stuff. * Use dist collectives instead of mpi routines. * remove old code for comm. * Fix bugs. still does not work. * modify to test the nccl side code path * Initial gather impl. Works intra-node. * Updates to comm. phase 2. nccl comm. passed the tests. * refactor code to introduce nccl/mpi as backends for onebit adam. * Refactor updates to test/engine. * Fix compile/runtime errors. * simplify support for nccl/mpi backends. * Add missign file * Add compression backend in constructor. Revert later. * modify test with some perf counting. * Implement a true non-blocking gather for nccl side. * Revert "Add compression backend in constructor. Revert later." This reverts commit df8c40d. * improve the 1-bit adam test. * Refactor comm. and compression backend in 1-bit adam. * Fix the test. * Fix runtime errors and typos in nccl backend * fix mpi backend. modify tests. * modify nccl perf test. * fix mpi side errors. * Add an mpi perf test * Sync DSE. * Remove old collectives file. * Undo a typo. * Graceful failure for torch versions that don't support nccl pt2pt. * Revert "Merge branch 'master' into staging-1bit-nccl-v2" This reverts commit 7840085, reversing changes made to a6dba72. * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2"" This reverts commit 6dbdd98. * comm optimization + 1-bit lamb * Saving/debugging commit. * finalizing 1-bit lamb * finalizing 1-bit lamb * add momentum mask and chkpt handling for 1-bit adam * Cleanup and modify nccl test to be runnable with deepspeed launcher. * Fix format. * fix formatting again. * make test runnable without mpi4py * Add dist.alltoall and dist.allgather instead of custom functions. * remove debug prints. * formatting and renaming * renaming * renaming * add unit test, fix existing tests * skip unit test when torch < 1.8 * revert 1-bit lamb * flatten momentum when dimension is more than 1 * add warning message for 1-bit adam under fp32 * improve version check * add fp32 test * 1-bit adam doc * fix file name * doc fix * torch 1.8 is released * doc fix * fix tests * update news * add doc for momentum mask * fix checkpoing handling, add unit test * checkpoint handling doc * doc final cleanup * bump dates * update tests * url change * doc fix * fix test * doc update Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * consistent checkpoint filenaming (microsoft#865) * consistent checkpoint filenaming * backward compatible rename Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [doc] launcher (microsoft#868) As discussed in microsoft#662 this PR modifies the doc: * explains what to use instead of CUDA_VISIBLE_DEVICES * puts the `--hostfile` cl arg in the correct place in the invocation script Fixes: microsoft#662 Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [doc] pipeline (microsoft#888) * [doc] pipeline As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. Thank you! * tweak * [debug utils] see_memory_usage fixes (microsoft#890) * see_memory_usage fixes * didn't expect pt-1.2 * fix the order of things * fix the order of things * full fp32 weights reconstruction for zero 2+3 (microsoft#892) * save_fp16_model consolidated for zero3 (microsoft#893) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861) * Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707 As some model trainable parameters skipped in training, their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, so they have no norm_for_param_grads * Trim space * Trim space Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * update kramdown (microsoft#901) security alert related to older kramdown version * update backward api doc (microsoft#903) * Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905) Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1. - [Release notes](https://github.com/gettalong/kramdown/releases) - [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page) - [Commits](https://github.com/gettalong/kramdown/commits) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * We're hiring! + integration posts * [website] We're hiring! + integration posts * [website] we're hiring! * zero.Init() clarification (microsoft#880) * zero.Init() clarification clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must. this proposal is via @samyam's clarification shared elsewhere. Thank you. * style * add clarity * style Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * disable pipe test (microsoft#915) This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though. * Add link to AML examples. (microsoft#916) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: brett koonce <koonce@gmail.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: hamlet <gvvvv@163.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: sid <sidney.black@aleph-alpha.de>

* add sycl kernesl repo to submodule. * update submodule sycl kernel

* Remove PP Grad Tail Check (microsoft#2538) * Only communicate grad tail if it exists Co-authored-by: Dashiell Stander <dash.stander@gmail.com> * Revert previous patch and just always send the grad tail * Formatting --------- Co-authored-by: Dashiell Stander <dash.stander@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> * Added __HIP_PLATFORM_AMD__=1 (microsoft#4570) * fix multiple definition while building evoformer (microsoft#4556) Current builder for evoformer use the same name for `attention.cpp` and `attention.cu`, leading to same intermediate filename `attention.o`: ```shell march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe - isystem /home/zejianxie/.conda/envs/dll/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/zejianxie/.conda/envs/dll/include build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention.o build/temp.linux-x86_64-cpython- 310/csrc/deepspeed4science/evoformer_attn/attention_back.o ``` and ``` `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)': tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): multiple definition of `attention_impl(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&)'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:tmpxft_0012bef1_00000000-6_attention.compute_86.cudafe1.cpp:(.text+0x330): first defined here /home/zejianxie/.conda/envs/dll/bin/../lib/gcc/x86_64-conda-linux-gnu/11.4.0/../../../../x86_64-conda-linux-gnu/bin/ld: build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): multiple definition of `torch::autograd::(anonymous namespace)::graph_task_id'; build/temp.linux-x86_64-cpython-310/csrc/deepspeed4science/evoformer_attn/attention.o:(.bss+0x0): first defined here ``` I use following to reproduce and confirm my fix works: ``` git clone https://github.com/NVIDIA/cutlass --depth 1 CUTLASS_PATH=$PWD/cutlass DS_BUILD_EVOFORMER_ATTN=1 pip install ./DeepSpeed --global-option="build_ext" ``` ![image](https://github.com/microsoft/DeepSpeed/assets/41792945/9e406b37-330c-431c-8bf9-6be378dee4ff) Co-authored-by: Conglong Li <conglong.li@gmail.com> * Update ccl.py --------- Co-authored-by: Quentin Anthony <qganthony@yahoo.com> Co-authored-by: Dashiell Stander <dash.stander@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com> Co-authored-by: Xie Zejian <xiezej@gmail.com> Co-authored-by: Conglong Li <conglong.li@gmail.com>

Handle missing optional configuration fields correctly

43adb8a

tjruwase added the bug Something isn't working label Feb 5, 2020

tjruwase requested review from ShadenSmith and jeffra February 5, 2020 22:00

tjruwase linked an issue Feb 5, 2020 that may be closed by this pull request

Make optimizer field optional in JSON config #16

Closed

ShadenSmith approved these changes Feb 5, 2020

View reviewed changes

Merge branch 'master' into olruwase/optional_optimizer

c50a553

jeffra approved these changes Feb 6, 2020

View reviewed changes

Merge branch 'master' into olruwase/optional_optimizer

8c0a4ee

tjruwase merged commit af81f6f into master Feb 6, 2020

ShadenSmith deleted the olruwase/optional_optimizer branch February 7, 2020 21:36

kouml pushed a commit to kouml/DeepSpeed that referenced this pull request Apr 3, 2020

Handle missing optional configuration fields correctly (microsoft#24)

bcb4ec7

Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

jeffra pushed a commit to jeffra/DeepSpeed that referenced this pull request May 15, 2020

Merge pull request microsoft#24 from microsoft/olruwase/checkpoint_lr…

96b2224

…_scheduler Optional loading optimizer and lr scheduler states

gongwei-130 mentioned this pull request Aug 7, 2020

'CUDA error: an illegal memory access was encountered' in forward #308

Open

GrvLeo mentioned this pull request Oct 22, 2020

Fail to use Zero-offload: "ModuleNotFoundError: No module named 'deepspeed.ops.adam.cpu_adam_op'" #483

Closed

rraminen pushed a commit to rraminen/DeepSpeed that referenced this pull request Apr 28, 2021

adding stochastic mode (microsoft#24)

a8a85a9

garvct mentioned this pull request Jun 29, 2021

Bert training model failed when add --deepspeed_transformer_kernel #1155

Open

rraminen pushed a commit to rraminen/DeepSpeed that referenced this pull request Jul 19, 2021

Merge pull request microsoft#24 from rraminen/IFU-master-2021-07-02

536d0bb

IFU-master-2021-07-02

delock referenced this pull request in delock/DeepSpeedSYCLSupport Sep 21, 2022

resolving conflict "add sycl kernesl repo to submodule. (#24)"

9923467

* add sycl kernesl repo to submodule. * update submodule sycl kernel

lambda7xx mentioned this pull request Feb 24, 2023

[BUG] Zero-Inference usage error with .init_inference() #2372

Closed

phalexo mentioned this pull request Oct 11, 2023

[BUG] The code for deepspeed.comm.comm.monitored_barrier() #4488

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle missing optional configuration fields correctly #24

Handle missing optional configuration fields correctly #24

tjruwase commented Feb 5, 2020

Handle missing optional configuration fields correctly #24

Handle missing optional configuration fields correctly #24

Conversation

tjruwase commented Feb 5, 2020