Skip to content
Branch: master
Find file History
thuningxu Shutdown horovod if some rank stalls for too long (#920)
* Shutdown horovod if some rank stalls for too long

Signed-off-by: Xu Ning <nx@uber.com>

* Make stall check/shutdown timing configurable via env

1. Changed `CheckForStalledTensors()` to return bool to indicate whether some ranks has lags exceeding `HOROVOD_STALL_SHUTDOWN_TIME_SECONDS`.
2. Reformulated the log printout about stalled ranks and tensors
3. In `RunLoopOnce()`, shutdown if `CheckForStalledTensors()` returns true.

Tested offline with a script. Example: https://gist.github.com/thuningxu/6311503ec04b3f38ee70d7a99b9bfa4c

Signed-off-by: Xu Ning <nx@uber.com>

* added unit test

Signed-off-by: Xu Ning <nx@uber.com>

* fix test

Signed-off-by: Xu Ning <nx@uber.com>
Latest commit 4f7319e Mar 20, 2019
Permalink
Type Name Latest commit message Commit time
..
Failed to load latest commit information.
common.py
test_keras.py
test_mxnet.py
test_spark.py
test_stall.py Shutdown horovod if some rank stalls for too long (#920) Mar 20, 2019
test_tensorflow.py Filter warnings to avoid Travis CI failure (#751) Jan 14, 2019
test_tensorflow_keras.py Filter warnings to avoid Travis CI failure (#751) Jan 14, 2019
test_timeline.py Optimize Horovod Timeline and add Cycle Markers (#782) Jan 26, 2019
test_torch.py
You can’t perform that action at this time.