Update Sparse Attention Tutorial #357

arashashari · 2020-09-04T00:10:35Z

No description provided.

Pulling upstream

…SA_tutorial_update

…dability

…emental; applying more comments

…SA_tutorial_update

* ZeRO-Offload v1 (squash) (#345) * update DSE to point to ZeRO-Offload staging * ZeRO-2 enable CPU offload (#313) * cpu-offload * update * deleted: deepspeed/pt/deepspeed_zero_optimizer_cpuoffload.py modified: deepspeed/pt/fp16_unfused_optimizer.py new file: install_output.txt modified: tests/unit/test_dynamic_loss_scale.py * modified: deepspeed/pt/deepspeed_zero_optimizer.py * update * modified: deepspeed/pt/deepspeed_cpu_adam.py modified: deepspeed/pt/deepspeed_zero_optimizer.py modified: tests/unit/test_checkpointing.py modified: tests/unit/test_fp16.py * deleted: install_output.txt * modified: deepspeed/pt/fp16_unfused_optimizer.py modified: tests/unit/test_dynamic_loss_scale.py * modified: deepspeed/pt/deepspeed_cpu_adam.py * modified: deepspeed/pt/deepspeed_zero_optimizer.py * modified: deepspeed/pt/deepspeed_cpu_adam.py modified: deepspeed/pt/deepspeed_zero_optimizer.py * deleted: deepspeed_cpu_adam.py modified: deepspeed_light.py modified: deepspeed_zero_optimizer.py ../../deepspeed_zero_optimizer_cpu_offload.py * modified: deepspeed/pt/deepspeed_light.py * modified: deepspeed/pt/deepspeed_light.py modified: deepspeed/pt/deepspeed_zero_optimizer.py modified: deepspeed/pt/deepspeed_zero_utils.py modified: tests/unit/test_fp16.py * modified: deepspeed/pt/deepspeed_config.py modified: deepspeed/pt/deepspeed_light.py modified: deepspeed/pt/deepspeed_zero_optimizer.py modified: tests/unit/test_checkpointing.py modified: tests/unit/test_fp16.py * modified: deepspeed/pt/deepspeed_checkpointing.py * update DSE to ZeRO-Offload commit Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Enable ZeRO checkpointing for ZeRO-Offload (#337) * Enable ZeRO checkpointing for ZeRO-Offload Fix unit tests Bump DSE to 33b9fb77c8cecdb49118188890f662526d8e9397 * Fix accidental revert * Add ZeRO-Offload checkpointing model tests (#344) * Enable ZeRO checkpointing for ZeRO-Offload Fix unit tests Bump DSE to 33b9fb77c8cecdb49118188890f662526d8e9397 * Fix accidental revert * Fix ZeRO-Offload checkpointing bug when change gpu count Add checkpointing model tests for ZeRO-Offload Remove optimizer key from Megatron model tests Use different deepspeed master port for Megatron model tests Co-authored-by: Jie <37380896+jren73@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * update DSE to staging for zero-dual * Update test_sparse_attention.py * Assert ZeRO-Offload+gradient accumulation (#347) * Adding link to Sparse Attention in Navigation page (#355) * adding link to Sparse Attention in Navigation page * Correctness and perf fixes (#354) * Update test_sparse_attention.py * jren changes * Merge with correctness/perf fixes * Formatting fixes Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * add cpu adam optimizer (#356) * add cpu adam optimizer * run precommit * clean adam_test * add accuracy test for adam * make the adam unit test work with random params and grads and for more steps * Samyamr/zero offload correctness (#359) * fixing gradient accumulation for zero offload * Bug fixes. ZeRO Stage 1,2 and Offload all produce the same loss with gradient accumulation step of 2 * Import path fixes + conditional imports (#358) * use relative imports and add support for conditional op imports * formatting and llvm command check change * fix remaining absolute import * hide the isntalled ops var * fix unit tests Co-authored-by: Reza Yazdani <reyazda@microsoft.com> * Enable contiguous gradients for cpu_offload * Allocating CPU memory directly on CPU without transfering them from GPU (#360) * Allocating CPU memory directly on CPU without transfering them from GPU * formatting fixes * change gpt2 pretrain to have DeepSpeed adam (#361) Co-authored-by: Reza Yazdani <reyazda@microsoft.com> * Jekyll installation instructions (#351) * Generalize detection of ZeRO supported optimizers (#349) * Improve test for ZeRO supported optimizers * Rename test function * Format fixes * Add model tests that wraps client FusedAdam with fused fp16 optimizer * Format fixes * everything is working * fixing the cpu_adam API and add deepspeed_adam flag in config.py (#365) * fixing the cpu_adam API and add deepspeed_adam flag in config.py * run precommit * fixing adam copy fp16-param-add more compile flags for cpu_adam * run precommit * fix variance indexes * fix array-sizes * ZeRO-Offload passing model functionality tests (#366) * cpu_offload enables overlap_comm and contiguous_gradients Remove non-portable tensor.mul_() * Model functionality tests now passing * Move to perf tests folder * move adam_test * rename perf test * fixing adam copy fp16-param and add more compile flags for cpu_adam (#367) * fixing adam copy fp16-param-add more compile flags for cpu_adam * run precommit * fix variance indexes * fix array-sizes * move adam_test * rename perf test * Perf tests * BumpDSE * fixed a typo; this was fixed before but seems like it has been lost in the refactor (#364) * Move code quality tests to Azure-hosted agents. (#368) * add casting kernel * run precommit * revert changes * revert changes * ZeRO-Offload: Integration code fixes (#370) * Various correctness fixes * Format fixes * Update installation instructions (#362) * Update Sparse Attention Tutorial (#357) * adding BingSqaud e2e test * updating the draft test; bring final step under try section * finalizinf test for base deepspeed and deepspeed with ZeRO * applying the comment (thanks Jeff); fixed formatting * update Sparse Attention Tutorial * fixed few issues and applied comments for better organization and readability * updated sparse attention tutorial with making how to use section incremental; applying more comments Co-authored-by: arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com> * fixing corner cases (#371) * fix adam perormance (#372) * fixing corner cases * revert to the previous perf for adam * adam high performance * run precommit * ZeRO-Offload passing model tests (#374) * Add ZeRO-Offload model tests Restrict optimizer update+copy to DeepSpeedCPUAdam * Format fixes * Increate bucket size scaler * fix cpu adam compilation for AVX2 (#378) * fixing the compilation error for AVX2 architecture * running precommit * adding cpufeature to requirements * Update install.sh * Update install.sh * include cpu-adam in the features * update features * update features Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Move code quality tests to Azure-hosted agents. (#368) * Bump DSE * adding sparse attention to feature index page (#377) * support avx2 by default (#383) * add DS_BUILD_AVX512 flag and update the feature part accordingly * run precommit Co-authored-by: Jie <37380896+jren73@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Arash Ashari <arashari@microsoft.com> Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: arashashari <arashashari@ArashMSLaptop.redmond.corp.microsoft.com>

* Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Update to master (#340) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * Versioned torch* optimizations (#341) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * fp16 fused mode * fp16 fused mode (#342) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * Support serialization versions * Support serialization of different torch versions (#343) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * distributed ckpt draft (#349) * inject parallel write * Support serialization of different torch versions (#343) (#345) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * finish split distributed write * split based-on num_bytes * resolving single node python test * remove irrelavent prints * format Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * torch serialization options * Configurable torch serialization (#350) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions * torch serialization options Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * Distributed writer slicing on byte boundary * Fix typo * FastFileWriter Config; Parallel writer nodes * Minor fix * remove warning from fast-io-ckpt (#354) * Relocate debug print * Parallel writing through byte boundary slicing (#351) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions * torch serialization options * Distributed writer slicing on byte boundary * Fix typo * FastFileWriter Config; Parallel writer nodes * Minor fix * remove warning from fast-io-ckpt (#354) * Relocate debug print Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> * fix broken mock_file_writer (#357) * Report write speed * DP writing * DP MoE checkpoints Generalize DP dense checkpoints for socket/machine options * Various improvements (#376) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions * torch serialization options * Distributed writer slicing on byte boundary * Fix typo * FastFileWriter Config; Parallel writer nodes * Minor fix * remove warning from fast-io-ckpt (#354) * Relocate debug print * Report write speed * DP writing * DP MoE checkpoints Generalize DP dense checkpoints for socket/machine options Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> * Decoupled checkpointing * New MP slicing algorithm * Format fixes * Decoupled checkpointing support (#384) * Integrate NVIDIA GPUDirect Storage into nvme library * 1) Remove debug prints 2) Create write file with random data 3) Delete target file before new writes * Workaround gds perf issue by leaking buffers * DGX2 mount/unmount utililties * Formatting * Add torch save/load * Add torch save/load * Remove gds * Add torch legacy save * Update to new cli * Add function signatures Add file_offset arg to read/write apis * Remove redundant asserts * Add DeepSpeedFileWriter * Add mock and python file writers * Format fixes * More perf counters * Fix pinned_offset bug; Show as not real python file object * Buffer copy speed * Add torch_fastio option * Format fixes * Measure torch_fastio perf * Force flush * Formatting * Renamings * Fix device bug * Disable torch.distributed requirement * Renaming * Integrate fast model checkpointing * Double I/O buffer optimization * Support larger sizes * Refactoring; save_storage api * Cast to byte tensor * Handle storage object saves * Remove mysterious import * Api to save storage object list; refactor stats * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation * Yangli2/fastio double buffer pytorch optimized (#291) * Double I/O buffer optimization * add pytorch optimization * fixed some syntax errors * comment out save_storage for mock * uncomment save storage for mock * fixed indentation Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Yang Li <yangli2@microsoft.com> * making deepspeed/runtime/fp16/loss_scaler/dynamiclossscale serializable * Dump fast_writer stats only on rank 0 * Configuration option for fused fp16 optimizer * Update to new API * Format fixes * Support torch* optimization for version 1.12 * Formatting * Versioned torch* optimization * fp16 fused mode * Support serialization versions * torch serialization options * Distributed writer slicing on byte boundary * Fix typo * FastFileWriter Config; Parallel writer nodes * Minor fix * remove warning from fast-io-ckpt (#354) * Relocate debug print * Report write speed * DP writing * DP MoE checkpoints Generalize DP dense checkpoints for socket/machine options * Decoupled checkpointing * New MP slicing algorithm * Format fixes Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> * add io multiplier for larger scale simulation (#411) * add io multiplier config for simulation * remove prints and test correctness * format * Merge with master * Format fixes * Guanhua/fast io clean v5 (#435) * Add environment variable to make nvcc compilation more verbose (#2759) * Bing/formatting correction (#2764) * modify engine.py for formatting * commit formatting changes on engine.py * Add links to new azureML examples (#2756) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix hardcoded instances to fp16 in optimizer creation log messages to the correct dtype. (#2743) * Remove hardcoded instances to fp16 in log messages. * Add model_dtype to print the correct format * Respond to PR feedback --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Refactor/Pydantify monitoring config (#2640) * pydantify monitoring configs --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Pin minimum `packaging` requirement (#2771) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix for diffusers v0.12.0 (#2753) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * update copy right in aio * type fix in ds_py_aio_handle * update year in aio/py_test * fix description in util pybind * update and remove prints in fast_file_writer * remove del print * remove dist barrier in engine.py * update year in runtime/model_ckpt * add todo in runtime/model_ckpt/util.py * update year * reverse pip3 * update opbuilder * format * modify print for python * fix print capability * fix print * some fix in flops_profiler (#2068) * bugs in profiler: 1. Tensor.bmm missed in _patch_tensor_methods function 2. missed funtions in _reload_functionals and _reload_tensor_methods functions 3. torch.mm and torch.Tensor.mm will have same __name__ in wrapFunc, my suggustion is use __str__ instead. * formatting --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * fix upsample flops compute by skipping unused kargs (#2773) * fix upsample flops compute by skipping unused kargs * fix format * format * Fix broken kernel inject bug (#2776) * format * remove zero change * fix engine issue --------- Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: Bing Xie <67908712+xiexbing@users.noreply.github.com> Co-authored-by: cassieesvelt <73311224+cassieesvelt@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: swli <47371259+lucasleesw@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com> * Formatting * Formatting * Debug file delete slowdown * Investigate write perf * Investigate write perf * Fix mising args * Fix microbenchmark and unit tests (#450) * Debug file delete slowdown * Investigate write perf * Investigate write perf * Fix mising args * Formatting * Rebase attempts * updates for running with newest dependencies * Pydantic fixes * Rebase fixes * Fix rebase bugs * Add DS utils for tensor casting * Fomat fixes * Fix GDS * Update with io_engine API * Continued rebase * Integrate GDS into writer factory * Add --venv_script option * Formatting fix Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: Bing Xie <67908712+xiexbing@users.noreply.github.com> Co-authored-by: cassieesvelt <73311224+cassieesvelt@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: swli <47371259+lucasleesw@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com> Co-authored-by: Ubuntu <jomayeri@microsoft.com>

arashashari and others added 7 commits May 15, 2020 19:05

adding BingSqaud e2e test

3ebab8f

updating the draft test; bring final step under try section

a637812

finalizinf test for base deepspeed and deepspeed with ZeRO

4212e3d

applying the comment (thanks Jeff); fixed formatting

5590a44

Merge pull request #1 from microsoft/master

a2984d0

Pulling upstream

Merge branch 'master' of https://github.com/microsoft/DeepSpeed into …

b7cf7f6

…SA_tutorial_update

update Sparse Attention Tutorial

cc92cfa

arashashari requested review from RezaYazdaniAminabadi, ShadenSmith, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, samyam and tjruwase as code owners September 4, 2020 00:10

arashashari and others added 2 commits September 4, 2020 11:32

Merge branch 'master' of https://github.com/microsoft/DeepSpeed into …

fed1634

…SA_tutorial_update

fixed few issues and applied comments for better organization and rea…

85bad45

…dability

ShadenSmith approved these changes Sep 4, 2020

View reviewed changes

arashashari added 2 commits September 4, 2020 17:58

updated sparse attention tutorial with making how to use section incr…

d7c5ff6

…emental; applying more comments

Merge branch 'master' of https://github.com/microsoft/DeepSpeed into …

79eaa1c

…SA_tutorial_update

jeffra approved these changes Sep 6, 2020

View reviewed changes

jeffra merged commit 9dadf38 into master Sep 6, 2020

bobisapotato mentioned this pull request Jan 24, 2021

Another thing to merge. (MY EYES HURT) bobisai/DeepSpeed#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Sparse Attention Tutorial #357

Update Sparse Attention Tutorial #357

Uh oh!

arashashari commented Sep 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Update Sparse Attention Tutorial #357

Update Sparse Attention Tutorial #357

Uh oh!

Conversation

arashashari commented Sep 4, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants