Skip to content
Branch: master
Find file History
thuningxu Shutdown horovod if some rank stalls for too long (#920)
* Shutdown horovod if some rank stalls for too long

Signed-off-by: Xu Ning <nx@uber.com>

* Make stall check/shutdown timing configurable via env

1. Changed `CheckForStalledTensors()` to return bool to indicate whether some ranks has lags exceeding `HOROVOD_STALL_SHUTDOWN_TIME_SECONDS`.
2. Reformulated the log printout about stalled ranks and tensors
3. In `RunLoopOnce()`, shutdown if `CheckForStalledTensors()` returns true.

Tested offline with a script. Example: https://gist.github.com/thuningxu/6311503ec04b3f38ee70d7a99b9bfa4c

Signed-off-by: Xu Ning <nx@uber.com>

* added unit test

Signed-off-by: Xu Ning <nx@uber.com>

* fix test

Signed-off-by: Xu Ning <nx@uber.com>
Latest commit 4f7319e Mar 20, 2019
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
_keras
common Shutdown horovod if some rank stalls for too long (#920) Mar 20, 2019
keras
mxnet Removed references to MPI from messages and moved flatbuffers to subm… Feb 19, 2019
run horovodrun: use os.execve() for mpirun (#931) Mar 19, 2019
spark Added horovodrun. (#869) Mar 18, 2019
tensorflow Removed references to MPI from messages and moved flatbuffers to subm… Feb 19, 2019
torch from elementSizeInBytes to element_size, following upstream commit ht… ( Mar 18, 2019
__init__.py
You can’t perform that action at this time.