Skip to content
Permalink
Branch: master
Commits on Mar 20, 2019
  1. Shutdown horovod if some rank stalls for too long (#920)

    thuningxu committed Mar 20, 2019
    * Shutdown horovod if some rank stalls for too long
    
    Signed-off-by: Xu Ning <nx@uber.com>
    
    * Make stall check/shutdown timing configurable via env
    
    1. Changed `CheckForStalledTensors()` to return bool to indicate whether some ranks has lags exceeding `HOROVOD_STALL_SHUTDOWN_TIME_SECONDS`.
    2. Reformulated the log printout about stalled ranks and tensors
    3. In `RunLoopOnce()`, shutdown if `CheckForStalledTensors()` returns true.
    
    Tested offline with a script. Example: https://gist.github.com/thuningxu/6311503ec04b3f38ee70d7a99b9bfa4c
    
    Signed-off-by: Xu Ning <nx@uber.com>
    
    * added unit test
    
    Signed-off-by: Xu Ning <nx@uber.com>
    
    * fix test
    
    Signed-off-by: Xu Ning <nx@uber.com>
Commits on Mar 19, 2019
  1. Bugfix Dockerfile (#933)

    alsrgv committed Mar 19, 2019
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
  2. Bump Horovod version to 0.16.1 (#928)

    alsrgv committed Mar 19, 2019
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
  3. horovodrun: use os.execve() for mpirun (#931)

    alsrgv committed Mar 19, 2019
    safe_shell_exec uses line buffering which does not play very well with
    Keras and tqdm -- libraries that make use of progress bars.
    
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
  4. Bugfix module name in horovodrun (#929)

    alsrgv committed Mar 19, 2019
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Commits on Mar 18, 2019
  1. Fixed two typos in horovodrun PR (#925)

    abditag2 committed Mar 18, 2019
    * Added more info to the horovodrun help and made np arg required
    
    Signed-off-by: fardin <fardin@uber.com>
  2. Adopt horovodrun (#924)

    alsrgv committed Mar 18, 2019
    * Adopt horovodrun
    
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
    
    * Update docs
    
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
    
    * Add newline to split paragraphs
    
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
  3. Added horovodrun. (#869)

    abditag2 committed Mar 18, 2019
    * Added horovodrun
    
    Signed-off-by: fardin <fardin@uber.com>
  4. from elementSizeInBytes to element_size, following upstream commit ht… (

    labor00 authored and alsrgv committed Mar 18, 2019
    #919)
    
    * from elementSizeInBytes to element_size, following upstream commit pytorch/pytorch#17785
    
    Signed-off-by: labor00 <abrvb@outlook.com>
    
    * uses TORCH_VERSION macro to ensure backward compatibility
    
    Signed-off-by: labor00 <abrvb@outlook.com>
    
    * if pytorch version string contains dev return a very big number
    
    Signed-off-by: labor00 <abrvb@outlook.com>
Commits on Mar 15, 2019
  1. Switch safe_shell_exec() from byte stdout/stderr to text stdout/stderr (

    alsrgv committed Mar 15, 2019
    #917)
    
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
  2. Use os.setsid() instead of os.setpgid() (#916)

    alsrgv committed Mar 15, 2019
    We've encountered an issue with launching ssh inside safe_shell_exec.
    OpenSSH makes use of tcsetattr() to set terminal properties, which gets
    propagated to whole process group.  setsid() provides better isolation
    of newly spawned process.
    
    Additional reading: https://en.wikipedia.org/wiki/Process_group
    
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Commits on Mar 13, 2019
  1. Add HOROVOD_CUDA_HOME documentation (#911)

    alsrgv committed Mar 13, 2019
    * Add HOROVOD_CUDA_HOME documentation
    
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
    
    * Add `HOROVOD_CUDA_INCLUDE` and `HOROVOD_CUDA_LIB`
    
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
    
    * Copyedits
    
    Signed-off-by: Alex Sergeev <alexander.sergeev@live.com>
Commits on Mar 12, 2019
Commits on Mar 9, 2019
  1. Refactor operations into separate components by framework (#826)

    tgaddair committed Mar 9, 2019
Commits on Mar 8, 2019
  1. Simplify mnist example using Gluon API and update README (#886)

    apeforest authored and alsrgv committed Mar 8, 2019
    * Simplify mnist example using Gluon API and update README
    
    Signed-off-by: Lin Yuan <apeforest@gmail.com>
    
    * replace no-cuda argument by -use-gpu
    
    Signed-off-by: Lin Yuan <apeforest@gmail.com>
    
    * change argument name
    
    Signed-off-by: Lin Yuan <apeforest@gmail.com>
  2. Fixed parameter manager to use the previously set value for Bayesian …

    tgaddair committed Mar 8, 2019
    …parameters when changing a free parameter to a constant (#888)
Commits on Mar 5, 2019
  1. Fix a bug in detecting MKLDNN (#879)

    yuxihu authored and alsrgv committed Mar 5, 2019
    Signed-off-by: Yuxi Hu <darrenyxhu@gmail.com>
Commits on Mar 4, 2019
  1. detect if MKLDNN is enabled in MXNet build (#868)

    wuxun-zhang authored and alsrgv committed Mar 4, 2019
    Signed-off-by: wuxun-zhang <wuxun.zhang@intel.com>
Commits on Mar 2, 2019
  1. Add PyPI download statistics (#873)

    alsrgv committed Mar 2, 2019
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
  2. Update README.md (#872)

    eric-haibin-lin authored and alsrgv committed Mar 2, 2019
    update imagenet example
    
    Signed-off-by: Haibin Lin <linhaibin.eric@gmail.com>
Commits on Feb 27, 2019
  1. Add issue templates (#865)

    alsrgv committed Feb 27, 2019
Commits on Feb 22, 2019
  1. Bugfix PyTorch ImageNet example (#853)

    alsrgv committed Feb 22, 2019
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Commits on Feb 21, 2019
  1. Add option to pass include/lib dirs for building with PowerAI DDL (#847)

    nvcastet authored and alsrgv committed Feb 21, 2019
    * Add option to pass include/lib dirs for building with PowerAI DDL
    
    IBM PowerAI DDL will be installed in Anaconda environments. The path
    will not always be '/opt/DL/ddl' anymore.
    HOROVOD_DDL_HOME or HOROVOD_DDL_INCLUDE/HOROVOD_DDL_LIB can be used to
    specify include/lib dirs.
    
    Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
    
    * Add legacy DDL include/lib dirs only when none are specified
    
    Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
  2. Make tensorflow_mnist_eager.py Python2-friendly (#846)

    alsrgv committed Feb 21, 2019
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
  3. Docker: add mxnet to build tag, switch to horovod/horovod (#845)

    alsrgv committed Feb 21, 2019
    * Docker: add mxnet to build tag, switch to horovod/horovod
    
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
    
    * Update docs as well
    
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
    
    * Add HOROVOD_WITH_MXNET=1
    
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Commits on Feb 20, 2019
  1. Bump version to 0.16.0 (#838)

    alsrgv committed Feb 20, 2019
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
  2. Fixed PyTorch with CUDA to use new DataType enum (#842)

    tgaddair committed Feb 20, 2019
    Signed-off-by: Travis Addair <taddair@uber.com>
Commits on Feb 19, 2019
  1. Add MXNet 1.4.0 to Dockerfile (#840)

    alsrgv committed Feb 19, 2019
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
  2. Update Dockerfile NCCL to 2.4.2 (#837)

    alsrgv committed Feb 19, 2019
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
  3. Switch Travis CI URL (#835)

    alsrgv committed Feb 19, 2019
    Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Commits on Feb 7, 2019
  1. MXNet: support Gluon Trainer API (#809)

    yuxihu authored and alsrgv committed Feb 7, 2019
Commits on Feb 6, 2019
  1. Improved autotuning scoring and parameter search process (#813)

    tgaddair authored and alsrgv committed Feb 6, 2019
Commits on Feb 4, 2019
  1. Align broadcast_variables() API with broadcast() (#807)

    alsrgv committed Feb 4, 2019
Older
You can’t perform that action at this time.