[zero] faster flatten/unflatten (cpp version) #910

stas00 · 2021-04-01T05:11:48Z

(this PR has evolved and has been edited from its original version which proposed using apex's functions)

This PR is trying to switch to the fastest version of flatten/unflatten and consistently use it through all zero stages.

as discovered in later comments of this PR #910 (comment) UtilsBuilder().load().flatten is on par with apex_C.flatten speed-wise and 2-3x faster than torch._utils._flatten_dense_tensors
(about the same for unflatten)

So this PR was modified to switch to UtilsBuilder().load().flatten and UtilsBuilder().load().unflatten

Let's try first if it may work loading it at import-level since it will trigger build the first time for JIT. If it doesn't work we can fold it into each class's __init__ like zero2.py had it prior to this PR.

Fixes: #877

@jeffra

jeffra · 2021-04-01T05:15:29Z

Good point @stas00, curious on @samyam's thoughts here too but we do have apex's (un)flatten ops ported into our JIT-compatible ops. We use them in zero-2, take a look here:

DeepSpeed/deepspeed/runtime/zero/stage2.py

Lines 124 to 127 in 7fcc891

    
           # Load pre-installed or JIT compile (un)flatten ops 
        
           util_ops = UtilsBuilder().load() 
        
           self.flatten = util_ops.flatten 
        
           self.unflatten = util_ops.unflatten

We might be able to use these in z3 instead so we don't have to depend on apex.

stas00 · 2021-04-01T05:17:10Z

Oh! then why doesn't zero3 use those? not fully optimized yet?

yes, apex is being phased out, so it's definitely a good idea to either ask pytorch to port this to the core or to copy away.

I haven't done the benchmarks so I suppose the first question is if someone did pytorch vs. apex vs your version.

jeffra · 2021-04-01T05:22:14Z

We should be able to use it as a drop in replacement I suspect. The (un)flatten code in our op is taken directly from apex, which was taken from pytorch haha take a look, it's not doing anything fancy:

https://github.com/microsoft/DeepSpeed/blob/master/csrc/utils/flatten_unflatten.cpp

stas00 · 2021-04-01T05:24:54Z

This is totally weird then because they don't use the c++ implementation in the core:
https://github.com/pytorch/pytorch/blob/33b95c6bac5cf9c52fa646bf7335664b0556263d/torch/_utils.py#L248

stas00 · 2021-04-01T05:25:27Z

Also there is a related issue: #877 if this apex code is removed it'll close this issue as well.

jeffra · 2021-04-01T05:27:41Z

This is totally weird then because they don't use the c++ implementation in the core:
https://github.com/pytorch/pytorch/blob/33b95c6bac5cf9c52fa646bf7335664b0556263d/torch/_utils.py#L248

haha yep, I've also dug into this and don't understand why torch doesn't just use these ops directly. Seems odd...

I believe @RezaYazdaniAminabadi did some small benchmarking at one point between the torch (un)flatten functions vs the cpp op and the cpp op performed significantly better. But I think we sometimes forget to use it in practice.

stas00 · 2021-04-01T05:31:43Z

It looks like this is an identical code, c++

https://github.com/pytorch/pytorch/blob/547346d66350b4ac325941752f53fa92f756f6a9/torch/csrc/utils/tensor_flatten.h#L10

inline at::Tensor flatten_dense_tensors(at::TensorList tensors) {
  static auto flatten = [](const at::Tensor &t) { return t.contiguous().view({-1}); };
  if (tensors.size() == 1)
    return flatten(tensors[0]);
  return at::cat(fmap(tensors, flatten));
}

python:

https://github.com/pytorch/pytorch/blob/33b95c6bac5cf9c52fa646bf7335664b0556263d/torch/_utils.py#L248

def _flatten_dense_tensors(tensors):
    if len(tensors) == 1:
        return tensors[0].contiguous().view(-1)
    flat = torch.cat([t.contiguous().view(-1) for t in tensors], dim=0)
    return flat

I wonder why it'd be faster other than bypassing the python machinery for intermediary results.

stas00 · 2021-04-01T05:34:21Z

I believe @RezaYazdaniAminabadi did some small benchmarking at one point between the torch (un)flatten functions vs the cpp op and the cpp op performed significantly better. But I think we sometimes forget to use it in practice.

@RezaYazdaniAminabadi, do you by chance have the benchmark still? If it's faster we should notify the pytorch devs to rectify this omission - perhaps this function pair is not widely used.

stas00 · 2021-04-01T20:23:10Z

I did some flatten benchmarking and the results are in:

I compared:

torch._utils._flatten_dense_tensors
UtilsBuilder().load().flatten
apex_C.flatten

Results:

UtilsBuilder().load().flatten and apex_C.flatten are on par
torch._utils._flatten_dense_tensors is 2 times slower than the first 2

For unflatten it's about the same with the gap for pytorch version at about 2-3x slower than the first 2 versions.

This was on RTX-3090. W/ prebuilt deepspeed.

I verified with cProfile, line_profiler and timeit - all getting the same results.

Setup:

pip install line_profiler

Benchmark for `flatten`

#!/usr/bin/env python

import argparse

import gc

import torch
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
from deepspeed.ops.op_builder import UtilsBuilder

from apex_C import flatten as flatten_apex

util_ops = UtilsBuilder().load()
flatten = util_ops.flatten
unflatten = util_ops.unflatten

torch.manual_seed(0)
# emulate a small typical model weights
x = [torch.rand((512,512)).cuda(), torch.rand((512,1024)).cuda(), torch.rand((512,30000)).cuda()]
t = x * 30

# warm up and check that the same output is produced
flat_py = _flatten_dense_tensors(t)
flat_cpp = flatten(t)
flat_apex = flatten_apex(t)
#numel = flat_cpp.numel()
assert torch.eq(flat_py, flat_cpp).all(), "both produce the same tensor"
assert torch.eq(flat_py, flat_apex).all(), "both produce the same tensor"

TIMES = 1000

# the programs being tested
def py():
    for i in range(TIMES):
        flat = _flatten_dense_tensors(t)

def cpp():
    for i in range(TIMES):
        flat = flatten(t)

def apex():
    for i in range(TIMES):
        flat = flatten_apex(t)

#### cProfile ####

import cProfile

def cprofileme():
    print("--------------- cProfile -----------------")
    print("py")
    cProfile.run("py()", sort=-1)
    gc.collect(); torch.cuda.empty_cache()
    print("cpp")
    cProfile.run("cpp()", sort=-1)
    gc.collect(); torch.cuda.empty_cache()
    print("apex")
    cProfile.run("apex()", sort=-1)
    gc.collect(); torch.cuda.empty_cache()

#### timeit ####

import timeit

def timeme():
    print("--------------- timeit -----------------")
    print(f'py  ={timeit.Timer("py()", globals=globals()).timeit(number=1)}')
    gc.collect(); torch.cuda.empty_cache()
    print(f'cpp ={timeit.Timer("cpp()", globals=globals()).timeit(number=1)}')
    gc.collect(); torch.cuda.empty_cache()
    print(f'apex={timeit.Timer("apex()", globals=globals()).timeit(number=1)}')
    gc.collect(); torch.cuda.empty_cache()

#### line_profiler ####
# this one requires a special way to be called
# pip install line_profiler
# kernprof -l flatten_bench.py -l; python -m line_profiler  flatten_bench.py.lprof

def line_profileme():
    print("--------------- line_profier -----------------")
    print("py")
    profile(py)()
    gc.collect(); torch.cuda.empty_cache()
    print("cpp")
    profile(cpp)()
    gc.collect(); torch.cuda.empty_cache()
    print("apex")
    profile(apex)()
    gc.collect(); torch.cuda.empty_cache()

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-l", action='store_true')
    parser.add_argument("-c", action='store_true')
    parser.add_argument("-t", action='store_true')
    args = parser.parse_args()
    if args.l:
        line_profileme()
    elif args.c:
        cprofileme()
    elif args.t:
        timeme()

It looks like that if I mix cProfile with timeit, the latter gets invalid results. So have to run each profiler separately.

Also before I added empty_cache the results were quite invalid - the first to run was getting 32x faster outcome!

Output:

(main-38) ✘ /hf/deepspeed> ./flatten_bench.py -t
--------------- timeit -----------------
py  =0.1116456389427185
cpp =0.06466162856668234
apex=0.06333659961819649
(main-38) /hf/deepspeed> ./flatten_bench.py -c
--------------- cProfile -----------------
py
         184004 function calls in 0.129 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.129    0.129 <string>:1(<module>)
     1000    0.009    0.000    0.128    0.000 _utils.py:248(_flatten_dense_tensors)
     1000    0.017    0.000    0.094    0.000 _utils.py:264(<listcomp>)
        1    0.001    0.001    0.129    0.129 flatten_bench.py:33(py)
        1    0.000    0.000    0.129    0.129 {built-in method builtins.exec}
     1000    0.000    0.000    0.000    0.000 {built-in method builtins.len}
     1000    0.024    0.000    0.024    0.000 {built-in method cat}
    90000    0.012    0.000    0.012    0.000 {method 'contiguous' of 'torch._C._TensorBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    90000    0.065    0.000    0.065    0.000 {method 'view' of 'torch._C._TensorBase' objects}


cpp
         1004 function calls in 0.063 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.063    0.063 <string>:1(<module>)
        1    0.001    0.001    0.063    0.063 flatten_bench.py:37(cpp)
        1    0.000    0.000    0.063    0.063 {built-in method builtins.exec}
     1000    0.062    0.000    0.062    0.000 {built-in method deepspeed.ops.utils_op.flatten}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


apex
         1004 function calls in 0.072 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.072    0.072 <string>:1(<module>)
        1    0.001    0.001    0.072    0.072 flatten_bench.py:41(apex)
     1000    0.071    0.000    0.071    0.000 {built-in method apex_C.flatten}
        1    0.000    0.000    0.072    0.072 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Benchmark for `unflatten`

#!/usr/bin/env python

import argparse
import gc
import torch
from torch._utils import _flatten_dense_tensors, _unflatten_dense_tensors
from deepspeed.ops.op_builder import UtilsBuilder

from apex_C import flatten as flatten_apex
from apex_C import unflatten as unflatten_apex

util_ops = UtilsBuilder().load()
flatten = util_ops.flatten
unflatten = util_ops.unflatten

torch.manual_seed(0)
# emulate a small typical model weights
x = [torch.rand((512,512)).cuda(), torch.rand((512,1024)).cuda(), torch.rand((512,30000)).cuda()]
unflat_t = x * 30

# warm up and check that the same output is produced
flat_py = _flatten_dense_tensors(unflat_t)
flat_cpp = flatten(unflat_t)
flat_apex = flatten_apex(unflat_t)
#numel = flat_cpp.numel()
assert torch.eq(flat_py, flat_cpp).all(), "both produce the same tensor"
assert torch.eq(flat_py, flat_apex).all(), "both produce the same tensor"

flat_t = flat_py
unflat_py = _unflatten_dense_tensors(flat_py, unflat_t)
for i in range(len(unflat_t)): assert torch.eq(unflat_t[i], unflat_py[i]).all()
unflat_cpp = _unflatten_dense_tensors(flat_cpp, unflat_t)
for i in range(len(unflat_t)): assert torch.eq(unflat_t[i], unflat_cpp[i]).all()
unflat_apex = _unflatten_dense_tensors(flat_apex, unflat_t)
for i in range(len(unflat_t)): assert torch.eq(unflat_t[i], unflat_apex[i]).all()

# the programs being tested
def py():
    for i in range(1000):
        unflat = _unflatten_dense_tensors(flat_t, unflat_t)

def cpp():
    for i in range(1000):
        unflat = unflatten(flat_t, unflat_t)

def apex():
    for i in range(1000):
        unflat = unflatten_apex(flat_t, unflat_t)


#### cProfile ####

import cProfile

def cprofileme():
    print("--------------- cProfile -----------------")
    print("py")
    cProfile.run("py()", sort=-1)
    gc.collect(); torch.cuda.empty_cache()
    print("cpp")
    cProfile.run("cpp()", sort=-1)
    gc.collect(); torch.cuda.empty_cache()
    print("apex")
    cProfile.run("apex()", sort=-1)
    gc.collect(); torch.cuda.empty_cache()

#### timeit ####

import timeit

def timeme():
    print("--------------- timeit -----------------")
    print(f'py  ={timeit.Timer("py()", globals=globals()).timeit(number=1)}')
    gc.collect(); torch.cuda.empty_cache()
    print(f'cpp ={timeit.Timer("cpp()", globals=globals()).timeit(number=1)}')
    gc.collect(); torch.cuda.empty_cache()
    print(f'apex={timeit.Timer("apex()", globals=globals()).timeit(number=1)}')
    gc.collect(); torch.cuda.empty_cache()

#### line_profiler ####
# this one requires a special way to be called
# pip install line_profiler
# kernprof -l unflatten_bench.py -l; python -m line_profiler unflatten_bench.py.lprof

def line_profileme():
    print("--------------- line_profier -----------------")
    print("py")
    profile(py)()
    gc.collect(); torch.cuda.empty_cache()
    print("cpp")
    profile(cpp)()
    gc.collect(); torch.cuda.empty_cache()
    print("apex")
    profile(apex)()
    gc.collect(); torch.cuda.empty_cache()

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("-l", action='store_true')
    parser.add_argument("-c", action='store_true')
    parser.add_argument("-t", action='store_true')
    args = parser.parse_args()
    if args.l:
        line_profileme()
    elif args.c:
        cprofileme()
    elif args.t:
        timeme()

Sample output:

(main-38) /hf/deepspeed> ./unflatten_bench.py -c
--------------- cProfile -----------------
py
         361004 function calls in 0.255 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.255    0.255 <string>:1(<module>)
     1000    0.063    0.000    0.245    0.000 _utils.py:284(_unflatten_dense_tensors)
        1    0.009    0.009    0.254    0.254 unflatten_bench.py:38(py)
        1    0.000    0.000    0.255    0.255 {built-in method builtins.exec}
    90000    0.005    0.000    0.005    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    90000    0.094    0.000    0.094    0.000 {method 'narrow' of 'torch._C._TensorBase' objects}
    90000    0.011    0.000    0.011    0.000 {method 'numel' of 'torch._C._TensorBase' objects}
    90000    0.072    0.000    0.072    0.000 {method 'view_as' of 'torch._C._TensorBase' objects}


cpp
         1004 function calls in 0.082 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.082    0.082 <string>:1(<module>)
        1    0.009    0.009    0.082    0.082 unflatten_bench.py:42(cpp)
        1    0.000    0.000    0.082    0.082 {built-in method builtins.exec}
     1000    0.073    0.000    0.073    0.000 {built-in method deepspeed.ops.utils_op.unflatten}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}


apex
         1004 function calls in 0.081 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.081    0.081 <string>:1(<module>)
        1    0.009    0.009    0.081    0.081 unflatten_bench.py:46(apex)
     1000    0.073    0.000    0.073    0.000 {built-in method apex_C.unflatten}
        1    0.000    0.000    0.081    0.081 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

kernprof -l unflatten_bench.py -l; python -m line_profiler  unflatten_bench.py.lprof
--------------- line_profier -----------------
py
cpp
apex
Wrote profile results to unflatten_bench.py.lprof
Timer unit: 1e-06 s

Total time: 0.254009 s
File: unflatten_bench.py
Function: py at line 38

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    38                                           def py():
    39      1001        356.0      0.4      0.1      for i in range(1000):
    40      1000     253653.0    253.7     99.9          unflat = _unflatten_dense_tensors(flat_t, unflat_t)

Total time: 0.088684 s
File: unflatten_bench.py
Function: cpp at line 42

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    42                                           def cpp():
    43      1001        304.0      0.3      0.3      for i in range(1000):
    44      1000      88380.0     88.4     99.7          unflat = unflatten(flat_t, unflat_t)

Total time: 0.087492 s
File: unflatten_bench.py
Function: apex at line 46

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    46                                           def apex():
    47      1001        334.0      0.3      0.4      for i in range(1000):
    48      1000      87158.0     87.2     99.6          unflat = unflatten_apex(flat_t, unflat_t)

stas00 · 2021-04-02T18:24:06Z

Proposed to pytorch to switch to the cpp version pytorch/pytorch#55240 so down the road we won't need the workaround.

stas00 · 2021-04-03T02:20:18Z

It looks good now. As I had to turn some functions into methods to make this work, I'm not sure if I placed them in the most intuitive place between other existing methods - please feel free to move them or tell me where to move them. Either way works.

@awan-10

* test sparse self_attn fix * [WarmupDecayLR] fix log(0) & 1/log(1) bugs (microsoft#772) * fix log(0) & 1/log(1) bugs * simplify Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> * bump to v0.3.12 * Bug fix: Remove client optimizer param_group list item that does not have 'params' (microsoft#827) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [doc] pipeline doc typos/improvements (microsoft#659) Admin merging for pure-doc PR that does not trigger build. * Samyamr/inference hook fix (microsoft#851) * Fix mis-aligned-grad When a parameter is not divisible by world size, the partitioned gradients are mis-aligned due to incorrect padding handling. This PR should fix for that. * Formatting fix * Adding static_scale test back for Z3, and also changing hidden size to be not divisile by world_size * also removing alignment from flat fp16 buffers * Testing for hidden dim alignment * inference hook fix * Update stage3.py * formatting * [bug-fix] move params to gpu if offload params is turned off Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * ZeRO Stage 2: Clear reduced gradients (microsoft#856) * Ensure gradients of other partitions are cleared after reduction * Remove redundant code Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [runner/launch] propagate the error (microsoft#854) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * docs: minor spelling tweaks (microsoft#858) * Allow args to be optional in deepspeed.initialize (microsoft#825) * Fix ZeRO3 save_checkpoint (microsoft#857) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Make config objects json serializable (microsoft#862) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * bump version 0.3.13 * 1-bit Adam v2 (microsoft#817) Authors: @awan-10 @conglongli @samyam @jeffra What's new: NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation. Add support to momentum masks for those parameters with constant zero gradients during training. Bug fixes (e.g., microsoft#813). * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (microsoft#594) * NCCL based 1-bit Implementation + Refactor to add communication backends (microsoft#593) * add nccl 1-bit optim. * temporary commit to save stuff. * Use dist collectives instead of mpi routines. * remove old code for comm. * Fix bugs. still does not work. * modify to test the nccl side code path * Initial gather impl. Works intra-node. * Updates to comm. phase 2. nccl comm. passed the tests. * refactor code to introduce nccl/mpi as backends for onebit adam. * Refactor updates to test/engine. * Fix compile/runtime errors. * simplify support for nccl/mpi backends. * Add missign file * Add compression backend in constructor. Revert later. * modify test with some perf counting. * Implement a true non-blocking gather for nccl side. * Revert "Add compression backend in constructor. Revert later." This reverts commit df8c40d. * improve the 1-bit adam test. * Refactor comm. and compression backend in 1-bit adam. * Fix the test. * Fix runtime errors and typos in nccl backend * fix mpi backend. modify tests. * modify nccl perf test. * fix mpi side errors. * Add an mpi perf test * Sync DSE. * Remove old collectives file. * Undo a typo. * Graceful failure for torch versions that don't support nccl pt2pt. * Revert "Merge branch 'master' into staging-1bit-nccl-v2" This reverts commit 7840085, reversing changes made to a6dba72. * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2"" This reverts commit 6dbdd98. * comm optimization + 1-bit lamb * Saving/debugging commit. * finalizing 1-bit lamb * finalizing 1-bit lamb * add momentum mask and chkpt handling for 1-bit adam * Cleanup and modify nccl test to be runnable with deepspeed launcher. * Fix format. * fix formatting again. * make test runnable without mpi4py * Add dist.alltoall and dist.allgather instead of custom functions. * remove debug prints. * formatting and renaming * renaming * renaming * add unit test, fix existing tests * skip unit test when torch < 1.8 * revert 1-bit lamb * flatten momentum when dimension is more than 1 * add warning message for 1-bit adam under fp32 * improve version check * add fp32 test * 1-bit adam doc * fix file name * doc fix * torch 1.8 is released * doc fix * fix tests * update news * add doc for momentum mask * fix checkpoing handling, add unit test * checkpoint handling doc * doc final cleanup * bump dates * update tests * url change * doc fix * fix test * doc update Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * consistent checkpoint filenaming (microsoft#865) * consistent checkpoint filenaming * backward compatible rename Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * [doc] launcher (microsoft#868) As discussed in microsoft#662 this PR modifies the doc: * explains what to use instead of CUDA_VISIBLE_DEVICES * puts the `--hostfile` cl arg in the correct place in the invocation script Fixes: microsoft#662 Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [doc] pipeline (microsoft#888) * [doc] pipeline As @g-karthik flagged in microsoft#659 (comment) my previous correction PR had one sentence that said the wrong thing. So this PR attempts to rectify that. Thank you! * tweak * [debug utils] see_memory_usage fixes (microsoft#890) * see_memory_usage fixes * didn't expect pt-1.2 * fix the order of things * fix the order of things * full fp32 weights reconstruction for zero 2+3 (microsoft#892) * save_fp16_model consolidated for zero3 (microsoft#893) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix zero stage2 cpu_offload when some model trainable parameters skipped in training (microsoft#861) * Fix zero stage2 cpu_offload when some model trainable parameters skipped in training, as in microsoft#707 As some model trainable parameters skipped in training, their backward hooks in self.create_reduce_and_remove_grad_hooks() will not run, so they have no norm_for_param_grads * Trim space * Trim space Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * mlperf attn initial commit * update kramdown (microsoft#901) security alert related to older kramdown version * update backward api doc (microsoft#903) * Bump kramdown from 2.3.0 to 2.3.1 in /docs (microsoft#905) Bumps [kramdown](https://github.com/gettalong/kramdown) from 2.3.0 to 2.3.1. - [Release notes](https://github.com/gettalong/kramdown/releases) - [Changelog](https://github.com/gettalong/kramdown/blob/master/doc/news.page) - [Commits](https://github.com/gettalong/kramdown/commits) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * We're hiring! + integration posts * [website] We're hiring! + integration posts * [website] we're hiring! * zero.Init() clarification (microsoft#880) * zero.Init() clarification clarify that if `model.half()` can't fit into gpu memory `zero.Init()` is a must. this proposal is via @samyam's clarification shared elsewhere. Thank you. * style * add clarity * style Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * disable pipe test (microsoft#915) This test has been giving us trouble for a bit, seeing nondeterministic failures, skipping for now to not break out CI. Need to revisit soon though. * Add link to AML examples. (microsoft#916) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * add inference_batch fn * Add space in help string (microsoft#926) * Fix for fragmented linear inputs in ZeRO 3 Linear layers where reshap… (microsoft#881) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [zero3] GatheredParameters can now handle a list of params (microsoft#884) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fix cpu_adam memory leak on deepspeed re-use in the same process (microsoft#896) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [benchmarks] flatten/unflatten benchmarks (microsoft#919) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * improved readability + typos (microsoft#895) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * [zero doc] fix misspelled param (microsoft#878) We really really really need those params to be validated... Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Samyamr/stage 3 skip modules without parameters (microsoft#867) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * docs (microsoft#909) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Supporting different hidden dimensions for transformer kernels-v2 (microsoft#934) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Pull changes from DeepSpeed * Pull changes from DeepSpeed * Pull changes from DeepSpeed * Pull changes from DeepSpeed * Pull changes from DeepSpeed * Pull changes from DeepSpeed * cleanup, reinstantiate sending of logits / layer_past * cleanup, reinstantiate sending of logits / layer_past * bump to 0.3.14 * add pypi badge * Delete check of pdsh (microsoft#941) * fix double linear override; spelling (microsoft#954) * [config] turn exponential notation back on for config dump (microsoft#955) * e-notation for large floats * handle ints too * readability * handle bool Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * document how to override ~/.cache/torch_extensions (microsoft#959) * [zero] faster flatten/unflatten (cpp version) (microsoft#910) * faster flatten/unflatten with apex * switch to cpp flatten/unflatten * style * better comment * missing import * switch to build ops at run time * fixes Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * update lr scheduler doc for doing per step or epoch update (microsoft#913) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Fix ZeRO-3 UnboundLocalError (microsoft#968) * Fix UnboundLocalError * Get full partition size * ZeRO-Infinity (microsoft#976) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> * revert zero-inf change to launcher * [docs] zero-inf updates * bump to 0.3.15 * ZeRO-Infinity tutorial additions (microsoft#978) * zinf tutorial * more megatron integration docs * [docs] add ZeRO-Inf news items * refactor * ZeRO-Infinity docs (microsoft#979) * zinf tutorial * more megatron integration docs * ZInf + tiling docs * [docs] zero-inf updates * assert no Z2/Z3 with pipeline and fix some docs links (microsoft#980) * add option to force multi-node launcher mode (microsoft#977) * [ZeRO Infinity] Allow Init to take a dict for the deepspeed config (microsoft#983) * Add check to see if json file is already loaded * Update doc * Address review * Remove doc comment Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * make bold+italic work without escaping _ (microsoft#775) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * remove debug prints: (microsoft#986) * 1-bit LAMB optimizer (microsoft#970) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. Author: @conglongli, @awan-10, @samyam, Hanlin Tang, Yuxiong He Paper: https://arxiv.org/abs/2104.06069 Co-authored-by: sdtblck <46172032+sdtblck@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Use odd shape tensor to represent parameter data in partitioned state (microsoft#981) * use wierd shaped tensor to avoid silent failures when not registering externel params * fix typo Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> * Make reduce scatter optional for ZeRO-1 as workaround (microsoft#971) * Make reduce scatter optional for ZeRO-1 as workaround * Make allreduce default for ZeRO 1 Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * Fix all Pipeline Module Parameters being sent to cuda:0 (microsoft#687) * remove communicate overflow (already in utils.CheckOverflow) Co-authored-by: sid <sidney.black@aleph-alpha.de> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Reza Yazdani <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: brett koonce <koonce@gmail.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: hamlet <gvvvv@163.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Takuya Makino <takuyamakino15@gmail.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: Sean Naren <sean@grid.ai>

stas00 · 2021-05-13T02:20:16Z

FYI, pytorch has now replaced the slower python version with the cpp version pytorch/pytorch#58006

faster flatten/unflatten with apex

24ed987

stas00 requested review from arashashari, awan-10, cli99, conglongli, eltonzheng, jeffra, minjiaz, niumanar, RezaYazdaniAminabadi, samyam, ShadenSmith and tjruwase as code owners April 1, 2021 05:11

stas00 mentioned this pull request Apr 1, 2021

[zero3] apex was installed without --cpp_ext. Falling back to Python flatten and unflatten. #877

Closed

stas00 mentioned this pull request Apr 2, 2021

[benchmarks] flatten/unflatten benchmarks #919

Merged

stas00 added 3 commits April 2, 2021 11:00

switch to cpp flatten/unflatten

cc56096

style

79c0e03

better comment

793d786

stas00 changed the title ~~[zero] faster flatten/unflatten with apex~~ [zero] faster flatten/unflatten (cpp version) Apr 2, 2021

stas00 mentioned this pull request Apr 2, 2021

_unflatten_dense_tensors isn't using the faster cpp version already in pytorch pytorch/pytorch#55240

Closed

stas00 added 4 commits April 2, 2021 14:24

missing import

ece584e

Merge remote-tracking branch 'origin/master' into faster-flatten

eeef3cc

switch to build ops at run time

370c123

fixes

a97cf27

stas00 mentioned this pull request Apr 9, 2021

w/o model-parallel usability numbers reproduce #284

Open

Merge branch 'master' into faster-flatten

541587b

tjruwase approved these changes Apr 14, 2021

View reviewed changes

jeffra approved these changes Apr 14, 2021

View reviewed changes

Merge branch 'master' into faster-flatten

f3550fe

tjruwase merged commit 8b8ed2a into microsoft:master Apr 14, 2021

stas00 deleted the faster-flatten branch April 14, 2021 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[zero] faster flatten/unflatten (cpp version) #910

[zero] faster flatten/unflatten (cpp version) #910

stas00 commented Apr 1, 2021 •

edited

Loading

jeffra commented Apr 1, 2021

stas00 commented Apr 1, 2021 •

edited

Loading

jeffra commented Apr 1, 2021

stas00 commented Apr 1, 2021

stas00 commented Apr 1, 2021

jeffra commented Apr 1, 2021

stas00 commented Apr 1, 2021 •

edited

Loading

stas00 commented Apr 1, 2021

stas00 commented Apr 1, 2021 •

edited

Loading

stas00 commented Apr 2, 2021

stas00 commented Apr 3, 2021

stas00 commented May 13, 2021

[zero] faster flatten/unflatten (cpp version) #910

[zero] faster flatten/unflatten (cpp version) #910

Conversation

stas00 commented Apr 1, 2021 • edited Loading

jeffra commented Apr 1, 2021

stas00 commented Apr 1, 2021 • edited Loading

jeffra commented Apr 1, 2021

stas00 commented Apr 1, 2021

stas00 commented Apr 1, 2021

jeffra commented Apr 1, 2021

stas00 commented Apr 1, 2021 • edited Loading

stas00 commented Apr 1, 2021

stas00 commented Apr 1, 2021 • edited Loading

Benchmark for flatten

Benchmark for unflatten

stas00 commented Apr 2, 2021

stas00 commented Apr 3, 2021

stas00 commented May 13, 2021

stas00 commented Apr 1, 2021 •

edited

Loading

stas00 commented Apr 1, 2021 •

edited

Loading

stas00 commented Apr 1, 2021 •

edited

Loading

stas00 commented Apr 1, 2021 •

edited

Loading

Benchmark for `flatten`

Benchmark for `unflatten`