Support resurrecting blacklisted hosts #3319

This adds support for resurrecting blacklisted hosts in elastic mode. Currently hosts that get blacklisted remain in the blacklist for the lifetime of the job. This cannot handle transient host failure or a scale-up after as scale-down. This is especially the case for the Kubeflow mpi-operator on Kubernetes, as it always gives pods known hostnames from its hostfile. This patch will allow blacklisted hosts to become whitelisted after a configured countdown period. Cooldown periods can be configured with the ``--blacklist-cooldown-range`` parameter like this: .. code-block:: bash $ horovodrun -np 8 --blacklist-cooldown-range 10 100 --min-np 4 --max-np 12 --host-discovery-script discover_hosts.py python train.py The above example configures the minimum cooldown period to 10 seconds and the maximum cooldown period to 100 seconds. The intial cooldown period would be 10 seconds. For repeat failures the cooldown period would grow with an exponential backoff delay: 10s, 20s, 30s, and so on. However, the maximum cooldown period would be capped at 100 seconds, regardless of failure count. The default behavior is to have no cooldown period, and blacklisted hosts would remain in blacklist. Signed-off-by: Abin Shahab <ashahab@linkedin.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support resurrecting blacklisted hosts #3319

Support resurrecting blacklisted hosts #3319

Commits on Jan 22, 2022