New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support resurrecting blacklisted hosts #3319
Conversation
9424a7f
to
ae0ad53
Compare
Unit Test Results (with flaky tests) 865 files - 78 865 suites - 78 9h 44m 28s ⏱️ - 1h 46m 55s For more details on these failures, see this check. Results for commit 20e1977. ± Comparison against base commit cce4207. ♻️ This comment has been updated with latest results. |
ae0ad53
to
cc76612
Compare
Some of the tests are failing with exit code 124/sigkill. I'll look at the CI scripts to see if they time-bound the tests(some of my tests take long because of waiting for cooldown. |
cc76612
to
a3a5d62
Compare
Generally LGTM. Just to make sure the tests go well. |
e46dcf0
to
c7876d3
Compare
@EnricoMi I'm getting a flaky error in one of the masOS tests. Do you know if this has something to do with the time limits on the tests?
|
@maxhgerlach is above macOS flakiness related to your fixes in #3300 / #3291? With timestamps:
The job timesout because the test does not finish. |
Hmm, we saw a similar problem with join and allgather (#3291 (comment)), which apparently went away with @Tixxx's fix in #3313. If this one is with allreduce though, the cause may be different. We observed another hang with torch 1.10 that went away by skipping a specific test: #3314 Something's fishy here. |
fc0f6d8
to
cf1df3e
Compare
@maxhgerlach I see this GPU buildkite failures in other PRs too, what can be done about this? |
Which other PRs have the same failures? You can ignore the "Unit Test Results (with flaky tests)" failures, but there are no "Unit Test Results" failure in master (https://github.com/horovod/horovod/runs/4696939351) and haven't been for the last four commits (https://github.com/horovod/horovod/actions/workflows/ci.yaml?query=branch%3Amaster+event%3Apush). It looks like your branch is already rebased to master HEAD. Maybe this is triggering some hidden issue or introducing it. |
cf1df3e
to
6172d1e
Compare
LGTM. An additional question: if the hostname of a worker is resurrected after cool down period but the worker is still down, as the worker is Not reporting itself, I suppose this worker will not be added to the group until it starts and reports itself. Am I right? |
@zw0610 No, the nodes would still be added even if they are down. The discovery script is supposed to figure out which nodes are up and return only healthy nodes. If we want to separate the discovery of healthy nodes from the discovery script, we can design in a health check hook, but that is not included in this PR. |
@ashahab Sorry for not explaining clearly. What I meant is such worker is not reported by |
@zw0610 that is an interesting test case, I'll add that in. def whitelist_all_hosts():
for host in host_slots.keys():
if self._hosts_state[host].is_resurrected():
self._hosts_state[host].whitelist()
def has_resurrected_hosts():
resurrected_hosts = [host for host in host_slots.keys() if self._hosts_state[host].is_resurrected()]
return len(resurrected_hosts) > 0 |
@ashahab That's great! It's exactly the behavior I expect from horovod elastic. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Need @tgaddair to take one more look.
6172d1e
to
86920a8
Compare
@chongxiaoc I just pushed the changes, lets wait for the tests. Thank you! |
@ashahab Updated my review. Can you please update the PR description, specially this part:
I think you are referring to the default here, which is different I think. And the example is not exponential. |
13c2134
to
cf5dd5c
Compare
cf5dd5c
to
6996e87
Compare
This adds support for resurrecting blacklisted hosts in elastic mode. Currently hosts that get blacklisted remain in the blacklist for the lifetime of the job. This cannot handle transient host failure or a scale-up after as scale-down. This is especially the case for the Kubeflow mpi-operator on Kubernetes, as it always gives pods known hostnames from its hostfile. This patch will allow blacklisted hosts to become whitelisted after a configured countdown period. Cooldown periods can be configured with the ``--blacklist-cooldown-range`` parameter like this: .. code-block:: bash $ horovodrun -np 8 --blacklist-cooldown-range 10 100 --min-np 4 --max-np 12 --host-discovery-script discover_hosts.py python train.py The above example configures the minimum cooldown period to 10 seconds and the maximum cooldown period to 100 seconds. The intial cooldown period would be 10 seconds. For repeat failures the cooldown period would grow with an exponential backoff delay: 10s, 20s, 30s, and so on. However, the maximum cooldown period would be capped at 100 seconds, regardless of failure count. The default behavior is to have no cooldown period, and blacklisted hosts would remain in blacklist. Signed-off-by: Abin Shahab <ashahab@linkedin.com>
6996e87
to
20e1977
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@ashahab the docs are failing: https://readthedocs.org/projects/horovod/builds/15848706/ |
Never mind, looks like this exists in master. Has been fixed in #3377. |
This adds support for resurrecting blacklisted hosts in elastic mode.
Currently hosts that get blacklisted remain in the blacklist for the lifetime of the job.
This cannot handle transient host failure or a scale-up after as scale-down.
This is especially the case for the Kubeflow mpi-operator on Kubernetes, as it always
gives pods known hostnames from its hostfile.
This patch will allow blacklisted hosts to become whitelisted after a configured cooldown period.
Users can configure the cooldown range by providing the
--blacklist-cooldown-range
parameter like this:The above example configures the minimum cooldown period to 10 seconds and the maximum cooldown period to 100 seconds.
The intial cooldown period would be 10 seconds. For repeat failures the cooldown period would grow with an exponential
backoff delay (with a constant exponent of 2): 10s, 20s, 40s, and so on. However, the maximum cooldown period would be
capped at 100 seconds, regardless of failure count. A random backoff fraction of the cooldown lower limit is added
to the cooldown delay.
The default behavior is to have no cooldown period, and blacklisted hosts would remain in blacklist.
Checklist before submitting
Description
Fixes #1926 (issue).
Review process to land