Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[zero] faster flatten/unflatten (cpp version) #910

Merged
merged 10 commits into from
Apr 14, 2021

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Apr 1, 2021

(this PR has evolved and has been edited from its original version which proposed using apex's functions)

This PR is trying to switch to the fastest version of flatten/unflatten and consistently use it through all zero stages.

as discovered in later comments of this PR #910 (comment) UtilsBuilder().load().flatten is on par with apex_C.flatten speed-wise and 2-3x faster than torch._utils._flatten_dense_tensors
(about the same for unflatten)

So this PR was modified to switch to UtilsBuilder().load().flatten and UtilsBuilder().load().unflatten

Let's try first if it may work loading it at import-level since it will trigger build the first time for JIT. If it doesn't work we can fold it into each class's __init__ like zero2.py had it prior to this PR.

Fixes: #877

@jeffra

@jeffra
Copy link
Contributor

jeffra commented Apr 1, 2021

Good point @stas00, curious on @samyam's thoughts here too but we do have apex's (un)flatten ops ported into our JIT-compatible ops. We use them in zero-2, take a look here:

# Load pre-installed or JIT compile (un)flatten ops
util_ops = UtilsBuilder().load()
self.flatten = util_ops.flatten
self.unflatten = util_ops.unflatten

We might be able to use these in z3 instead so we don't have to depend on apex.

@stas00
Copy link
Contributor Author

stas00 commented Apr 1, 2021

Oh! then why doesn't zero3 use those? not fully optimized yet?

yes, apex is being phased out, so it's definitely a good idea to either ask pytorch to port this to the core or to copy away.

I haven't done the benchmarks so I suppose the first question is if someone did pytorch vs. apex vs your version.

@jeffra
Copy link
Contributor

jeffra commented Apr 1, 2021

We should be able to use it as a drop in replacement I suspect. The (un)flatten code in our op is taken directly from apex, which was taken from pytorch haha take a look, it's not doing anything fancy:

https://github.com/microsoft/DeepSpeed/blob/master/csrc/utils/flatten_unflatten.cpp

@stas00
Copy link
Contributor Author

stas00 commented Apr 1, 2021

This is totally weird then because they don't use the c++ implementation in the core:
https://github.com/pytorch/pytorch/blob/33b95c6bac5cf9c52fa646bf7335664b0556263d/torch/_utils.py#L248

@stas00
Copy link
Contributor Author

stas00 commented Apr 1, 2021

Also there is a related issue: #877 if this apex code is removed it'll close this issue as well.

@jeffra
Copy link
Contributor

jeffra commented Apr 1, 2021

This is totally weird then because they don't use the c++ implementation in the core:
https://github.com/pytorch/pytorch/blob/33b95c6bac5cf9c52fa646bf7335664b0556263d/torch/_utils.py#L248

haha yep, I've also dug into this and don't understand why torch doesn't just use these ops directly. Seems odd...

I believe @RezaYazdaniAminabadi did some small benchmarking at one point between the torch (un)flatten functions vs the cpp op and the cpp op performed significantly better. But I think we sometimes forget to use it in practice.

@stas00
Copy link
Contributor Author

stas00 commented Apr 1, 2021

It looks like this is an identical code, c++

https://github.com/pytorch/pytorch/blob/547346d66350b4ac325941752f53fa92f756f6a9/torch/csrc/utils/tensor_flatten.h#L10

inline at::Tensor flatten_dense_tensors(at::TensorList tensors) {
  static auto flatten = [](const at::Tensor &t) { return t.contiguous().view({-1}); };
  if (tensors.size() == 1)
    return flatten(tensors[0]);
  return at::cat(fmap(tensors, flatten));
}

python:

https://github.com/pytorch/pytorch/blob/33b95c6bac5cf9c52fa646bf7335664b0556263d/torch/_utils.py#L248

def _flatten_dense_tensors(tensors):
    if len(tensors) == 1:
        return tensors[0].contiguous().view(-1)
    flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
    return flat

I wonder why it'd be faster other than bypassing the python machinery for intermediary results.

@stas00
Copy link
Contributor Author

stas00 commented Apr 1, 2021

I believe @RezaYazdaniAminabadi did some small benchmarking at one point between the torch (un)flatten functions vs the cpp op and the cpp op performed significantly better. But I think we sometimes forget to use it in practice.

@RezaYazdaniAminabadi, do you by chance have the benchmark still? If it's faster we should notify the pytorch devs to rectify this omission - perhaps this function pair is not widely used.

@stas00
Copy link
Contributor Author

stas00 commented Apr 1, 2021

I did some flatten benchmarking and the results are in:

I compared:

  1. torch._utils._flatten_dense_tensors
  2. UtilsBuilder().load().flatten
  3. apex_C.flatten

Results:

  1. UtilsBuilder().load().flatten and apex_C.flatten are on par
  2. torch._utils._flatten_dense_tensors is 2 times slower than the first 2

For unflatten it's about the same with the gap for pytorch version at about 2-3x slower than the first 2 versions.

This was on RTX-3090. W/ prebuilt deepspeed.

I verified with cProfile, line_profiler and timeit - all getting the same results.

Setup:

pip install line_profiler

Benchmark for flatten

#!/usr/bin/env python

import argparse

import gc

import torch
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
from deepspeed.ops.op_builder import UtilsBuilder

from apex_C import flatten as flatten_apex

util_ops = UtilsBuilder().load()
flatten = util_ops.flatten
unflatten = util_ops.unflatten

torch.manual_seed(0)
# emulate a small typical model weights
x = [torch.rand((512,512)).cuda(), torch.rand((512,1024)).cuda(), torch.rand((512,30000)).cuda()]
t = x * 30

# warm up and check that the same output is produced
flat_py = _flatten_dense_tensors(t)
flat_cpp = flatten(t)
flat_apex = flatten_apex(t)
#numel = flat_cpp.numel()
assert torch.eq(flat_py, flat_cpp).all(), "both produce the same tensor"
assert torch.eq(flat_py, flat_apex).all(), "both produce the same tensor"

TIMES = 1000

# the programs being tested
def py():
    for i in range(TIMES):
        flat = _flatten_dense_tensors(t)

def cpp():
    for i in range(TIMES):
        flat = flatten(t)

def apex():
    for i in range(TIMES):
        flat = flatten_apex(t)

#### cProfile ####

import cProfile

def cprofileme():
    print("--------------- cProfile -----------------")
    print("py")
    cProfile.run("py()", sort=-1)
    gc.collect(); torch.cuda.empty_cache()
    print("cpp")
    cProfile.run("cpp()", sort=-1)
    gc.collect(); torch.cuda.empty_cache()
    print("apex")
    cProfile.run("apex()", sort=-1)
    gc.collect(); torch.cuda.empty_cache()

#### timeit ####

import timeit

def timeme():
    print("--------------- timeit -----------------")
    print(f'py  ={timeit.Timer("py()", globals=globals()).timeit(number=1)}')
    gc.collect(); torch.cuda.empty_cache()
    print(f'cpp ={timeit.Timer("cpp()", globals=globals()).timeit(number=1)}')
    gc.collect(); torch.cuda.empty_cache()
    print(f'apex={timeit.Timer("apex()", globals=globals()).timeit(number=1)}')
    gc.collect(); torch.cuda.empty_cache()

#### line_profiler ####
# this one requires a special way to be called
# pip install line_profiler
# kernprof -l flatten_bench.py -l; python -m line_profiler  flatten_bench.py.lprof

def line_profileme():
    print("--------------- line_profier -----------------")
    print("py")
    profile(py)()
    gc.collect(); torch.cuda.empty_cache()
    print("cpp")
    profile(cpp)()
    gc.collect(); torch.cuda.empty_cache()
    print("apex")
    profile(apex)()
    gc.collect(); torch.cuda.empty_cache()

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-l", action='store_true')
    parser.add_argument("-c", action='store_true')
    parser.add_argument("-t", action='store_true')
    args = parser.parse_args()
    if args.l:
        line_profileme()
    elif args.c:
        cprofileme()
    elif args.t:
        timeme()

It looks like that if I mix cProfile with timeit, the latter gets invalid results. So have to run each profiler separately.

Also before I added empty_cache the results were quite invalid - the first to run was getting 32x faster outcome!

Output:

(main-38) ✘ /hf/deepspeed> ./flatten_bench.py -t
--------------- timeit -----------------
py  =0.1116456389427185
cpp =0.06466162856668234
apex=0.06333659961819649
(main-38) /hf/deepspeed> ./flatten_bench.py -c
--------------- cProfile -----------------
py
         184004 function calls in 0.129 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.129    0.129 <string>:1(<module>)
     1000    0.009    0.000    0.128    0.000 _utils.py:248(_flatten_dense_tensors)
     1000    0.017    0.000    0.094    0.000 _utils.py:264(<listcomp>)
        1    0.001    0.001    0.129    0.129 flatten_bench.py:33(py)
        1    0.000    0.000    0.129    0.129 {built-in method builtins.exec}
     1000    0.000    0.000    0.000    0.000 {built-in method builtins.len}
     1000    0.024    0.000    0.024    0.000 {built-in method cat}
    90000    0.012    0.000    0.012    0.000 {method 'contiguous' of 'torch._C._TensorBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    90000    0.065    0.000    0.065    0.000 {method 'view' of 'torch._C._TensorBase' objects}


cpp
         1004 function calls in 0.063 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.063    0.063 <string>:1(<module>)
        1    0.001    0.001    0.063    0.063 flatten_bench.py:37(cpp)
        1    0.000    0.000    0.063    0.063 {built-in method builtins.exec}
     1000    0.062    0.000    0.062    0.000 {built-in method deepspeed.ops.utils_op.flatten}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


apex
         1004 function calls in 0.072 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.072    0.072 <string>:1(<module>)
        1    0.001    0.001    0.072    0.072 flatten_bench.py:41(apex)
     1000    0.071    0.000    0.071    0.000 {built-in method apex_C.flatten}
        1    0.000    0.000    0.072    0.072 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Benchmark for unflatten

#!/usr/bin/env python

import argparse
import gc
import torch
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
from deepspeed.ops.op_builder import UtilsBuilder

from apex_C import flatten as flatten_apex
from apex_C import unflatten as unflatten_apex

util_ops = UtilsBuilder().load()
flatten = util_ops.flatten
unflatten = util_ops.unflatten

torch.manual_seed(0)
# emulate a small typical model weights
x = [torch.rand((512,512)).cuda(), torch.rand((512,1024)).cuda(), torch.rand((512,30000)).cuda()]
unflat_t = x * 30

# warm up and check that the same output is produced
flat_py = _flatten_dense_tensors(unflat_t)
flat_cpp = flatten(unflat_t)
flat_apex = flatten_apex(unflat_t)
#numel = flat_cpp.numel()
assert torch.eq(flat_py, flat_cpp).all(), "both produce the same tensor"
assert torch.eq(flat_py, flat_apex).all(), "both produce the same tensor"

flat_t = flat_py
unflat_py = _unflatten_dense_tensors(flat_py, unflat_t)
for i in range(len(unflat_t)): assert torch.eq(unflat_t[i], unflat_py[i]).all()
unflat_cpp = _unflatten_dense_tensors(flat_cpp, unflat_t)
for i in range(len(unflat_t)): assert torch.eq(unflat_t[i], unflat_cpp[i]).all()
unflat_apex = _unflatten_dense_tensors(flat_apex, unflat_t)
for i in range(len(unflat_t)): assert torch.eq(unflat_t[i], unflat_apex[i]).all()

# the programs being tested
def py():
    for i in range(1000):
        unflat = _unflatten_dense_tensors(flat_t, unflat_t)

def cpp():
    for i in range(1000):
        unflat = unflatten(flat_t, unflat_t)

def apex():
    for i in range(1000):
        unflat = unflatten_apex(flat_t, unflat_t)


#### cProfile ####

import cProfile

def cprofileme():
    print("--------------- cProfile -----------------")
    print("py")
    cProfile.run("py()", sort=-1)
    gc.collect(); torch.cuda.empty_cache()
    print("cpp")
    cProfile.run("cpp()", sort=-1)
    gc.collect(); torch.cuda.empty_cache()
    print("apex")
    cProfile.run("apex()", sort=-1)
    gc.collect(); torch.cuda.empty_cache()

#### timeit ####

import timeit

def timeme():
    print("--------------- timeit -----------------")
    print(f'py  ={timeit.Timer("py()", globals=globals()).timeit(number=1)}')
    gc.collect(); torch.cuda.empty_cache()
    print(f'cpp ={timeit.Timer("cpp()", globals=globals()).timeit(number=1)}')
    gc.collect(); torch.cuda.empty_cache()
    print(f'apex={timeit.Timer("apex()", globals=globals()).timeit(number=1)}')
    gc.collect(); torch.cuda.empty_cache()

#### line_profiler ####
# this one requires a special way to be called
# pip install line_profiler
# kernprof -l unflatten_bench.py -l; python -m line_profiler unflatten_bench.py.lprof

def line_profileme():
    print("--------------- line_profier -----------------")
    print("py")
    profile(py)()
    gc.collect(); torch.cuda.empty_cache()
    print("cpp")
    profile(cpp)()
    gc.collect(); torch.cuda.empty_cache()
    print("apex")
    profile(apex)()
    gc.collect(); torch.cuda.empty_cache()

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-l", action='store_true')
    parser.add_argument("-c", action='store_true')
    parser.add_argument("-t", action='store_true')
    args = parser.parse_args()
    if args.l:
        line_profileme()
    elif args.c:
        cprofileme()
    elif args.t:
        timeme()

Sample output:

(main-38) /hf/deepspeed> ./unflatten_bench.py -c
--------------- cProfile -----------------
py
         361004 function calls in 0.255 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.255    0.255 <string>:1(<module>)
     1000    0.063    0.000    0.245    0.000 _utils.py:284(_unflatten_dense_tensors)
        1    0.009    0.009    0.254    0.254 unflatten_bench.py:38(py)
        1    0.000    0.000    0.255    0.255 {built-in method builtins.exec}
    90000    0.005    0.000    0.005    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    90000    0.094    0.000    0.094    0.000 {method 'narrow' of 'torch._C._TensorBase' objects}
    90000    0.011    0.000    0.011    0.000 {method 'numel' of 'torch._C._TensorBase' objects}
    90000    0.072    0.000    0.072    0.000 {method 'view_as' of 'torch._C._TensorBase' objects}


cpp
         1004 function calls in 0.082 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.082    0.082 <string>:1(<module>)
        1    0.009    0.009    0.082    0.082 unflatten_bench.py:42(cpp)
        1    0.000    0.000    0.082    0.082 {built-in method builtins.exec}
     1000    0.073    0.000    0.073    0.000 {built-in method deepspeed.ops.utils_op.unflatten}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


apex
         1004 function calls in 0.081 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.081    0.081 <string>:1(<module>)
        1    0.009    0.009    0.081    0.081 unflatten_bench.py:46(apex)
     1000    0.073    0.000    0.073    0.000 {built-in method apex_C.unflatten}
        1    0.000    0.000    0.081    0.081 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

kernprof -l unflatten_bench.py -l; python -m line_profiler  unflatten_bench.py.lprof
--------------- line_profier -----------------
py
cpp
apex
Wrote profile results to unflatten_bench.py.lprof
Timer unit: 1e-06 s

Total time: 0.254009 s
File: unflatten_bench.py
Function: py at line 38

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    38                                           def py():
    39      1001        356.0      0.4      0.1      for i in range(1000):
    40      1000     253653.0    253.7     99.9          unflat = _unflatten_dense_tensors(flat_t, unflat_t)

Total time: 0.088684 s
File: unflatten_bench.py
Function: cpp at line 42

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    42                                           def cpp():
    43      1001        304.0      0.3      0.3      for i in range(1000):
    44      1000      88380.0     88.4     99.7          unflat = unflatten(flat_t, unflat_t)

Total time: 0.087492 s
File: unflatten_bench.py
Function: apex at line 46

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    46                                           def apex():
    47      1001        334.0      0.3      0.4      for i in range(1000):
    48      1000      87158.0     87.2     99.6          unflat = unflatten_apex(flat_t, unflat_t)

@stas00 stas00 changed the title [zero] faster flatten/unflatten with apex [zero] faster flatten/unflatten (cpp version) Apr 2, 2021
@stas00
Copy link
Contributor Author

stas00 commented Apr 2, 2021

Proposed to pytorch to switch to the cpp version pytorch/pytorch#55240 so down the road we won't need the workaround.

@stas00
Copy link
Contributor Author

stas00 commented Apr 3, 2021

It looks good now. As I had to turn some functions into methods to make this work, I'm not sure if I placed them in the most intuitive place between other existing methods - please feel free to move them or tell me where to move them. Either way works.

@tjruwase tjruwase merged commit 8b8ed2a into microsoft:master Apr 14, 2021
@stas00 stas00 deleted the faster-flatten branch April 14, 2021 19:40
sdtblck added a commit to EleutherAI/DeeperSpeed that referenced this pull request Apr 22, 2021
* test sparse self_attn fix

* [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772)

* fix log(0) & 1/log(1) bugs

* simplify

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>

* bump to v0.3.12

* Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline doc typos/improvements (microsoft#659)

Admin merging for pure-doc PR that does not trigger build.

* Samyamr/inference hook fix (microsoft#851)

* Fix mis-aligned-grad

When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that.

* Formatting fix

* Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size

* also removing alignment from flat fp16 buffers

* Testing for hidden dim alignment

* inference hook fix

* Update stage3.py

* formatting

* [bug-fix] move params to gpu if offload params is turned off

Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* ZeRO Stage 2: Clear reduced gradients (microsoft#856)

* Ensure gradients of other partitions are cleared after reduction

* Remove redundant code

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [runner/launch] propagate the error (microsoft#854)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* docs: minor spelling tweaks (microsoft#858)

* Allow args to be optional in deepspeed.initialize (microsoft#825)

* Fix ZeRO3 save_checkpoint (microsoft#857)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Make config objects json serializable (microsoft#862)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* bump version 0.3.13

* 1-bit Adam v2 (microsoft#817)

Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., microsoft#813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 7840085, reversing
changes made to a6dba72.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd98.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* consistent checkpoint filenaming (microsoft#865)

* consistent checkpoint filenaming

* backward compatible rename

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* [doc] launcher (microsoft#868)

As discussed in microsoft#662 this PR modifies the doc:
* explains what to use instead of CUDA_VISIBLE_DEVICES
* puts the `--hostfile` cl arg in the correct place in the invocation script

Fixes: microsoft#662

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [doc] pipeline (microsoft#888)

* [doc] pipeline

As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. 

Thank you!

* tweak

* [debug utils] see_memory_usage fixes (microsoft#890)

* see_memory_usage fixes

* didn't expect pt-1.2

* fix the order of things

* fix the order of things

* full fp32 weights reconstruction for zero 2+3 (microsoft#892)

* save_fp16_model consolidated for zero3 (microsoft#893)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861)

* Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707

As some model trainable parameters skipped in training,
their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, 
so they have no norm_for_param_grads

* Trim space

* Trim space

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* mlperf attn initial commit

* update kramdown (microsoft#901)

security alert related to older kramdown version

* update backward api doc (microsoft#903)

* Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905)

Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1.
- [Release notes](https://github.com/gettalong/kramdown/releases)
- [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page)
- [Commits](https://github.com/gettalong/kramdown/commits)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* We're hiring! + integration posts

* [website] We're hiring! + integration posts

* [website] we're hiring!

* zero.Init() clarification (microsoft#880)

* zero.Init() clarification

clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must.

this proposal is via @samyam's clarification shared elsewhere.

Thank you.

* style

* add clarity

* style

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* disable pipe test (microsoft#915)

This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though.

* Add link to AML examples. (microsoft#916)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* add inference_batch fn

* Add space in help string (microsoft#926)

* Fix for fragmented linear inputs in ZeRO 3 Linear layers where reshap… (microsoft#881)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [zero3] GatheredParameters can now handle a list of params (microsoft#884)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* fix cpu_adam memory leak on deepspeed re-use in the same process (microsoft#896)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [benchmarks] flatten/unflatten benchmarks (microsoft#919)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* improved readability + typos (microsoft#895)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* [zero doc] fix misspelled param (microsoft#878)

We really really really need those params to be validated...

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Samyamr/stage 3 skip modules without parameters (microsoft#867)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* docs (microsoft#909)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Supporting different hidden dimensions for transformer kernels-v2 (microsoft#934)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* Pull changes from DeepSpeed

* cleanup, reinstantiate sending of logits / layer_past

* cleanup, reinstantiate sending of logits / layer_past

* bump to 0.3.14

* add pypi badge

* Delete check of pdsh (microsoft#941)

* fix double linear override; spelling (microsoft#954)

* [config] turn exponential notation back on for config dump (microsoft#955)

* e-notation for large floats

* handle ints too

* readability

* handle bool

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* document how to override ~/.cache/torch_extensions (microsoft#959)

* [zero] faster flatten/unflatten (cpp version)  (microsoft#910)

* faster flatten/unflatten with apex

* switch to cpp flatten/unflatten

* style

* better comment

* missing import

* switch to build ops at run time

* fixes

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* update lr scheduler doc for doing per step or epoch update (microsoft#913)

* update lr scheduler doc for doing per step or epoch update

* work

* trigger build

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Fix ZeRO-3 UnboundLocalError (microsoft#968)

* Fix UnboundLocalError

* Get full partition size

* ZeRO-Infinity (microsoft#976)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>

* revert zero-inf change to launcher

* [docs] zero-inf updates

* bump to 0.3.15

* ZeRO-Infinity tutorial additions (microsoft#978)

* zinf tutorial

* more megatron integration docs

* [docs] add ZeRO-Inf news items

* refactor

* ZeRO-Infinity docs (microsoft#979)

* zinf tutorial

* more megatron integration docs

* ZInf + tiling docs

* [docs] zero-inf updates

* assert no Z2/Z3 with pipeline and fix some docs links (microsoft#980)

* add option to force multi-node launcher mode (microsoft#977)

* [ZeRO Infinity] Allow Init to take a dict for the deepspeed config  (microsoft#983)

* Add check to see if json file is already loaded

* Update doc

* Address review

* Remove doc comment

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* make bold+italic work without escaping _ (microsoft#775)

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* remove debug prints: (microsoft#986)

* 1-bit LAMB optimizer (microsoft#970)

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed.
Author: @conglongli, @awan-10, @samyam, Hanlin Tang, Yuxiong He
Paper: https://arxiv.org/abs/2104.06069

Co-authored-by: sdtblck <46172032+sdtblck@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Use odd shape tensor to represent parameter data in partitioned state (microsoft#981)

* use wierd shaped tensor to avoid silent failures when not registering externel params

* fix typo

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>

* Make reduce scatter optional for ZeRO-1 as workaround (microsoft#971)

* Make reduce scatter optional for ZeRO-1 as workaround

* Make allreduce default for ZeRO 1

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* Fix all Pipeline Module Parameters being sent to cuda:0 (microsoft#687)

* remove communicate overflow (already in utils.CheckOverflow)

Co-authored-by: sid <sidney.black@aleph-alpha.de>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: brett koonce <koonce@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: hamlet <gvvvv@163.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Takuya Makino <takuyamakino15@gmail.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: Sean Naren <sean@grid.ai>
@stas00
Copy link
Contributor Author

stas00 commented May 13, 2021

FYI, pytorch has now replaced the slower python version with the cpp version pytorch/pytorch#58006

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[zero3] apex was installed without --cpp_ext. Falling back to Python flatten and unflatten.
3 participants