Support resurrecting blacklisted hosts #3319

ashahab · 2021-12-13T23:11:32Z

This adds support for resurrecting blacklisted hosts in elastic mode.
Currently hosts that get blacklisted remain in the blacklist for the lifetime of the job.
This cannot handle transient host failure or a scale-up after as scale-down.
This is especially the case for the Kubeflow mpi-operator on Kubernetes, as it always
gives pods known hostnames from its hostfile.

This patch will allow blacklisted hosts to become whitelisted after a configured cooldown period.
Users can configure the cooldown range by providing the --blacklist-cooldown-range parameter like this:

$ horovodrun -np 8 --blacklist-cooldown-range 10 100 --min-np 4 --max-np 12 --host-discovery-script discover_hosts.py python train.py

The above example configures the minimum cooldown period to 10 seconds and the maximum cooldown period to 100 seconds.
The intial cooldown period would be 10 seconds. For repeat failures the cooldown period would grow with an exponential
backoff delay (with a constant exponent of 2): 10s, 20s, 40s, and so on. However, the maximum cooldown period would be
capped at 100 seconds, regardless of failure count. A random backoff fraction of the cooldown lower limit is added
to the cooldown delay.
The default behavior is to have no cooldown period, and blacklisted hosts would remain in blacklist.

Checklist before submitting

[x ] Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

Fixes #1926 (issue).

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

ashahab · 2021-12-13T23:14:30Z

@tgaddair @zw0610 This implements #1926. Please take a look.

github-actions · 2021-12-14T01:30:31Z

Unit Test Results

    773 files -   28     773 suites - 28 8h 55m 22s ⏱️ - 1m 46s
    722 tests +    5     674 ✔️ +    4     48 💤 +    1 0 ❌ ±0
16 693 runs - 617 11 772 ✔️ - 392 4 921 💤 - 225 0 ❌ ±0

Results for commit 20e1977. ± Comparison against base commit cce4207.

♻️ This comment has been updated with latest results.

github-actions · 2021-12-14T01:30:44Z

Unit Test Results (with flaky tests)

    865 files -     78     865 suites - 78 9h 44m 28s ⏱️ - 1h 46m 55s
    722 tests +      5     673 ✔️ +      7     48 💤 +    1 1 ❌ - 3
18 913 runs - 1 561 13 192 ✔️ - 1 187 5 719 💤 - 365 2 ❌ - 9

For more details on these failures, see this check.

Results for commit 20e1977. ± Comparison against base commit cce4207.

♻️ This comment has been updated with latest results.

horovod/runner/elastic/discovery.py

ashahab · 2021-12-14T22:17:09Z

Some of the tests are failing with exit code 124/sigkill. I'll look at the CI scripts to see if they time-bound the tests(some of my tests take long because of waiting for cooldown.

zw0610 · 2021-12-15T02:31:33Z

Generally LGTM. Just to make sure the tests go well.

ashahab · 2021-12-20T08:42:19Z

@EnricoMi I'm getting a flaky error in one of the masOS tests. Do you know if this has something to do with the time limits on the tests?

[1,0]<stdout>:test_torch.py::TorchTests::test_horovod_join_allreduce [1,1]<stdout>:
[1,1]<stdout>:test_torch.py::TorchTests::test_horovod_join_allreduce 
Error: Process completed with exit code 124.

EnricoMi · 2021-12-20T21:28:48Z

@maxhgerlach is above macOS flakiness related to your fixes in #3300 / #3291?

With timestamps:

Mon, 20 Dec 2021 06:33:33 GMT [1,0]<stdout>:test_torch.py::TorchTests::test_horovod_join_allgather [1,0]<stdout>:PASSED[1,1]<stdout>:PASSED[1,0]<stdout>:
Mon, 20 Dec 2021 06:33:33 GMT [1,0]<stdout>:test_torch.py::TorchTests::test_horovod_join_allreduce [1,1]<stdout>:
Mon, 20 Dec 2021 06:38:31 GMT [1,1]<stdout>:test_torch.py::TorchTests::test_horovod_join_allreduce

The job timesout because the test does not finish.

maxhgerlach · 2021-12-20T21:55:07Z

Hmm, we saw a similar problem with join and allgather (#3291 (comment)), which apparently went away with @Tixxx's fix in #3313. If this one is with allreduce though, the cause may be different.

We observed another hang with torch 1.10 that went away by skipping a specific test: #3314

Something's fishy here.

ashahab · 2022-01-05T02:02:32Z

@maxhgerlach I see this GPU buildkite failures in other PRs too, what can be done about this?

EnricoMi · 2022-01-05T09:06:48Z

Which other PRs have the same failures? You can ignore the "Unit Test Results (with flaky tests)" failures, but there are no "Unit Test Results" failure in master (https://github.com/horovod/horovod/runs/4696939351) and haven't been for the last four commits (https://github.com/horovod/horovod/actions/workflows/ci.yaml?query=branch%3Amaster+event%3Apush).

It looks like your branch is already rebased to master HEAD. Maybe this is triggering some hidden issue or introducing it.

ashahab · 2022-01-06T00:18:44Z

Thanks for the tip @EnricoMi . I figured out the issue, we need to look at why ray tests become flaky on GPU heads. I created the following issue: #3347

@tgaddair @zw0610 can you review this PR?

zw0610 · 2022-01-06T03:46:13Z

LGTM. An additional question: if the hostname of a worker is resurrected after cool down period but the worker is still down, as the worker is Not reporting itself, I suppose this worker will not be added to the group until it starts and reports itself. Am I right?

ashahab · 2022-01-06T07:11:59Z

@zw0610 No, the nodes would still be added even if they are down. The discovery script is supposed to figure out which nodes are up and return only healthy nodes. If we want to separate the discovery of healthy nodes from the discovery script, we can design in a health check hook, but that is not included in this PR.

zw0610 · 2022-01-06T07:41:10Z

@zw0610 No, the nodes would still be added even if they are down. The discovery script is supposed to figure out which nodes are up and return only healthy nodes. If we want to separate the discovery of healthy nodes from the discovery script, we can design in a health check hook, but that is not included in this PR.

@ashahab Sorry for not explaining clearly. What I meant is such worker is not reported by discover_host.sh script but it's already past its cool down time. I suppose the HostManager checks if a worker should be still blacklisted or whitelisted only after this worker being reported by discover_host.sh script again.

ashahab · 2022-01-11T17:43:19Z

@zw0610 that is an interesting test case, I'll add that in.
However, the answer to that is no, hosts not in discovery script would never get added, as they all come from discovery.host_slots:

        def whitelist_all_hosts():
            for host in host_slots.keys():
                if self._hosts_state[host].is_resurrected():
                    self._hosts_state[host].whitelist()

        def has_resurrected_hosts():
            resurrected_hosts = [host for host in host_slots.keys() if self._hosts_state[host].is_resurrected()]
            return len(resurrected_hosts) > 0

zw0610 · 2022-01-12T02:26:33Z

hosts not in discovery script would never get added

@ashahab That's great! It's exactly the behavior I expect from horovod elastic.

chongxiaoc

LGTM. Need @tgaddair to take one more look.

docs/elastic.rst

horovod/ray/elastic_v2.py

horovod/runner/elastic/discovery.py

horovod/ray/elastic_v2.py

chongxiaoc · 2022-01-20T18:30:37Z

Hi @ashahab , have you addressed @EnricoMi 's change request? We can land this PR soon.

ashahab · 2022-01-20T19:30:57Z

@chongxiaoc I just pushed the changes, lets wait for the tests. Thank you!

docs/elastic.rst

horovod/ray/elastic_v2.py

horovod/runner/elastic/discovery.py

EnricoMi · 2022-01-20T20:36:55Z

@ashahab Updated my review. Can you please update the PR description, specially this part:

For repeat failures the cooldown period grows with an exponential backoff delay: 10s, 20s, 30s. Cooldown period is capped
at 5 minutes.

I think you are referring to the default here, which is different I think. And the example is not exponential.

horovod/ray/elastic_v2.py

horovod/runner/elastic/discovery.py

This adds support for resurrecting blacklisted hosts in elastic mode. Currently hosts that get blacklisted remain in the blacklist for the lifetime of the job. This cannot handle transient host failure or a scale-up after as scale-down. This is especially the case for the Kubeflow mpi-operator on Kubernetes, as it always gives pods known hostnames from its hostfile. This patch will allow blacklisted hosts to become whitelisted after a configured countdown period. Cooldown periods can be configured with the ``--blacklist-cooldown-range`` parameter like this: .. code-block:: bash $ horovodrun -np 8 --blacklist-cooldown-range 10 100 --min-np 4 --max-np 12 --host-discovery-script discover_hosts.py python train.py The above example configures the minimum cooldown period to 10 seconds and the maximum cooldown period to 100 seconds. The intial cooldown period would be 10 seconds. For repeat failures the cooldown period would grow with an exponential backoff delay: 10s, 20s, 30s, and so on. However, the maximum cooldown period would be capped at 100 seconds, regardless of failure count. The default behavior is to have no cooldown period, and blacklisted hosts would remain in blacklist. Signed-off-by: Abin Shahab <ashahab@linkedin.com>

EnricoMi

LGTM!

EnricoMi · 2022-01-22T09:47:03Z

@ashahab the docs are failing: https://readthedocs.org/projects/horovod/builds/15848706/

EnricoMi · 2022-01-22T09:48:15Z

@ashahab the docs are failing: https://readthedocs.org/projects/horovod/builds/15848706/

Never mind, looks like this exists in master. Has been fixed in #3377.

ashahab force-pushed the abin-blacklist-resurrection branch from 9424a7f to ae0ad53 Compare December 13, 2021 23:13

zw0610 reviewed Dec 14, 2021

View reviewed changes

horovod/runner/elastic/discovery.py Outdated Show resolved Hide resolved

zw0610 mentioned this pull request Dec 14, 2021

Implement more forgiving host blacklist policy in elastic mode #1926

Closed

ashahab force-pushed the abin-blacklist-resurrection branch from ae0ad53 to cc76612 Compare December 14, 2021 18:52

ashahab force-pushed the abin-blacklist-resurrection branch from cc76612 to a3a5d62 Compare December 14, 2021 22:17

ashahab force-pushed the abin-blacklist-resurrection branch 4 times, most recently from e46dcf0 to c7876d3 Compare December 20, 2021 04:09

ashahab force-pushed the abin-blacklist-resurrection branch 2 times, most recently from fc0f6d8 to cf1df3e Compare January 4, 2022 17:27

ashahab force-pushed the abin-blacklist-resurrection branch from cf1df3e to 6172d1e Compare January 5, 2022 17:18

ashahab requested review from tgaddair and chongxiaoc January 11, 2022 17:33

chongxiaoc approved these changes Jan 12, 2022

View reviewed changes

EnricoMi requested changes Jan 12, 2022

View reviewed changes

tgaddair approved these changes Jan 20, 2022

View reviewed changes

ashahab force-pushed the abin-blacklist-resurrection branch from 6172d1e to 86920a8 Compare January 20, 2022 19:28

EnricoMi requested changes Jan 20, 2022

View reviewed changes

EnricoMi reviewed Jan 20, 2022

View reviewed changes

horovod/ray/elastic_v2.py Outdated Show resolved Hide resolved

ashahab force-pushed the abin-blacklist-resurrection branch 2 times, most recently from 13c2134 to cf5dd5c Compare January 21, 2022 07:49

EnricoMi requested changes Jan 21, 2022

View reviewed changes

horovod/runner/elastic/discovery.py Outdated Show resolved Hide resolved

ashahab force-pushed the abin-blacklist-resurrection branch from cf5dd5c to 6996e87 Compare January 22, 2022 00:04

ashahab force-pushed the abin-blacklist-resurrection branch from 6996e87 to 20e1977 Compare January 22, 2022 05:42

EnricoMi approved these changes Jan 22, 2022

View reviewed changes

EnricoMi merged commit d261b5e into horovod:master Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support resurrecting blacklisted hosts #3319

Support resurrecting blacklisted hosts #3319

ashahab commented Dec 13, 2021 •

edited

ashahab commented Dec 13, 2021

github-actions bot commented Dec 14, 2021 •

edited

github-actions bot commented Dec 14, 2021 •

edited

ashahab commented Dec 14, 2021

zw0610 commented Dec 15, 2021

ashahab commented Dec 20, 2021

EnricoMi commented Dec 20, 2021 •

edited

maxhgerlach commented Dec 20, 2021

ashahab commented Jan 5, 2022

EnricoMi commented Jan 5, 2022

ashahab commented Jan 6, 2022

zw0610 commented Jan 6, 2022

ashahab commented Jan 6, 2022

zw0610 commented Jan 6, 2022

ashahab commented Jan 11, 2022 •

edited

zw0610 commented Jan 12, 2022

chongxiaoc left a comment

chongxiaoc commented Jan 20, 2022

ashahab commented Jan 20, 2022

EnricoMi commented Jan 20, 2022

EnricoMi left a comment

EnricoMi commented Jan 22, 2022

EnricoMi commented Jan 22, 2022 •

edited

Support resurrecting blacklisted hosts #3319

Support resurrecting blacklisted hosts #3319

Conversation

ashahab commented Dec 13, 2021 • edited

Checklist before submitting

Description

Review process to land

ashahab commented Dec 13, 2021

github-actions bot commented Dec 14, 2021 • edited

Unit Test Results

github-actions bot commented Dec 14, 2021 • edited

Unit Test Results (with flaky tests)

ashahab commented Dec 14, 2021

zw0610 commented Dec 15, 2021

ashahab commented Dec 20, 2021

EnricoMi commented Dec 20, 2021 • edited

maxhgerlach commented Dec 20, 2021

ashahab commented Jan 5, 2022

EnricoMi commented Jan 5, 2022

ashahab commented Jan 6, 2022

zw0610 commented Jan 6, 2022

ashahab commented Jan 6, 2022

zw0610 commented Jan 6, 2022

ashahab commented Jan 11, 2022 • edited

zw0610 commented Jan 12, 2022

chongxiaoc left a comment

Choose a reason for hiding this comment

chongxiaoc commented Jan 20, 2022

ashahab commented Jan 20, 2022

EnricoMi commented Jan 20, 2022

EnricoMi left a comment

Choose a reason for hiding this comment

EnricoMi commented Jan 22, 2022

EnricoMi commented Jan 22, 2022 • edited

ashahab commented Dec 13, 2021 •

edited

github-actions bot commented Dec 14, 2021 •

edited

github-actions bot commented Dec 14, 2021 •

edited

EnricoMi commented Dec 20, 2021 •

edited

ashahab commented Jan 11, 2022 •

edited

EnricoMi commented Jan 22, 2022 •

edited