Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support resurrecting blacklisted hosts #3319

Merged
merged 1 commit into from Jan 24, 2022

Commits on Jan 22, 2022

  1. Support resurrecting blacklisted hosts

    This adds support for resurrecting blacklisted hosts in elastic mode.
    Currently hosts that get blacklisted remain in the blacklist for the lifetime of the job.
    This cannot handle transient host failure or a scale-up after as scale-down.
    This is especially the case for the Kubeflow mpi-operator on Kubernetes, as it always
    gives pods known hostnames from its hostfile.
    
    This patch will allow blacklisted hosts to become whitelisted after a configured countdown period.
    Cooldown periods can be configured with the ``--blacklist-cooldown-range`` parameter like this:
    
    .. code-block:: bash
        $ horovodrun -np 8 --blacklist-cooldown-range 10 100 --min-np 4 --max-np 12 --host-discovery-script discover_hosts.py python train.py
    
    The above example configures the minimum cooldown period to 10 seconds and the maximum cooldown period to 100 seconds.
    The intial cooldown period would be 10 seconds. For repeat failures the cooldown period would grow with an exponential
    backoff delay: 10s, 20s, 30s, and so on. However, the maximum cooldown period would be capped at 100 seconds, regardless
    of failure count. The default behavior is to have no cooldown period, and blacklisted hosts would remain in blacklist.
    
    Signed-off-by: Abin Shahab <ashahab@linkedin.com>
    ashahab committed Jan 22, 2022
    Copy the full SHA
    20e1977 View commit details
    Browse the repository at this point in the history