Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support resurrecting blacklisted hosts #3319

Merged
merged 1 commit into from Jan 24, 2022

Conversation

ashahab
Copy link
Collaborator

@ashahab ashahab commented Dec 13, 2021

This adds support for resurrecting blacklisted hosts in elastic mode.
Currently hosts that get blacklisted remain in the blacklist for the lifetime of the job.
This cannot handle transient host failure or a scale-up after as scale-down.
This is especially the case for the Kubeflow mpi-operator on Kubernetes, as it always
gives pods known hostnames from its hostfile.

This patch will allow blacklisted hosts to become whitelisted after a configured cooldown period.
Users can configure the cooldown range by providing the --blacklist-cooldown-range parameter like this:

$ horovodrun -np 8 --blacklist-cooldown-range 10 100 --min-np 4 --max-np 12 --host-discovery-script discover_hosts.py python train.py

The above example configures the minimum cooldown period to 10 seconds and the maximum cooldown period to 100 seconds.
The intial cooldown period would be 10 seconds. For repeat failures the cooldown period would grow with an exponential
backoff delay (with a constant exponent of 2): 10s, 20s, 40s, and so on. However, the maximum cooldown period would be
capped at 100 seconds, regardless of failure count. A random backoff fraction of the cooldown lower limit is added
to the cooldown delay.
The default behavior is to have no cooldown period, and blacklisted hosts would remain in blacklist.

Checklist before submitting

  • [x ] Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes #1926 (issue).

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@ashahab
Copy link
Collaborator Author

ashahab commented Dec 13, 2021

@tgaddair @zw0610 This implements #1926. Please take a look.

@github-actions
Copy link

github-actions bot commented Dec 14, 2021

Unit Test Results

     773 files   -   28       773 suites   - 28   8h 55m 22s ⏱️ - 1m 46s
     722 tests +    5       674 ✔️ +    4       48 💤 +    1  0 ±0 
16 693 runs   - 617  11 772 ✔️  - 392  4 921 💤  - 225  0 ±0 

Results for commit 20e1977. ± Comparison against base commit cce4207.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Dec 14, 2021

Unit Test Results (with flaky tests)

     865 files   -      78       865 suites   - 78   9h 44m 28s ⏱️ - 1h 46m 55s
     722 tests +       5       673 ✔️ +       7       48 💤 +    1  1  - 3 
18 913 runs   - 1 561  13 192 ✔️  - 1 187  5 719 💤  - 365  2  - 9 

For more details on these failures, see this check.

Results for commit 20e1977. ± Comparison against base commit cce4207.

♻️ This comment has been updated with latest results.

@ashahab
Copy link
Collaborator Author

ashahab commented Dec 14, 2021

Some of the tests are failing with exit code 124/sigkill. I'll look at the CI scripts to see if they time-bound the tests(some of my tests take long because of waiting for cooldown.

@zw0610
Copy link
Contributor

zw0610 commented Dec 15, 2021

Generally LGTM. Just to make sure the tests go well.

@ashahab ashahab force-pushed the abin-blacklist-resurrection branch 4 times, most recently from e46dcf0 to c7876d3 Compare December 20, 2021 04:09
@ashahab
Copy link
Collaborator Author

ashahab commented Dec 20, 2021

@EnricoMi I'm getting a flaky error in one of the masOS tests. Do you know if this has something to do with the time limits on the tests?

[1,0]<stdout>:test_torch.py::TorchTests::test_horovod_join_allreduce [1,1]<stdout>:
[1,1]<stdout>:test_torch.py::TorchTests::test_horovod_join_allreduce 
Error: Process completed with exit code 124.

@EnricoMi
Copy link
Collaborator

EnricoMi commented Dec 20, 2021

@maxhgerlach is above macOS flakiness related to your fixes in #3300 / #3291?

With timestamps:

Mon, 20 Dec 2021 06:33:33 GMT [1,0]<stdout>:test_torch.py::TorchTests::test_horovod_join_allgather [1,0]<stdout>:PASSED[1,1]<stdout>:PASSED[1,0]<stdout>:
Mon, 20 Dec 2021 06:33:33 GMT [1,0]<stdout>:test_torch.py::TorchTests::test_horovod_join_allreduce [1,1]<stdout>:
Mon, 20 Dec 2021 06:38:31 GMT [1,1]<stdout>:test_torch.py::TorchTests::test_horovod_join_allreduce 

The job timesout because the test does not finish.

@maxhgerlach
Copy link
Collaborator

Hmm, we saw a similar problem with join and allgather (#3291 (comment)), which apparently went away with @Tixxx's fix in #3313. If this one is with allreduce though, the cause may be different.

We observed another hang with torch 1.10 that went away by skipping a specific test: #3314

Something's fishy here.

@ashahab ashahab force-pushed the abin-blacklist-resurrection branch 2 times, most recently from fc0f6d8 to cf1df3e Compare January 4, 2022 17:27
@ashahab
Copy link
Collaborator Author

ashahab commented Jan 5, 2022

@maxhgerlach I see this GPU buildkite failures in other PRs too, what can be done about this?

@EnricoMi
Copy link
Collaborator

EnricoMi commented Jan 5, 2022

Which other PRs have the same failures? You can ignore the "Unit Test Results (with flaky tests)" failures, but there are no "Unit Test Results" failure in master (https://github.com/horovod/horovod/runs/4696939351) and haven't been for the last four commits (https://github.com/horovod/horovod/actions/workflows/ci.yaml?query=branch%3Amaster+event%3Apush).

It looks like your branch is already rebased to master HEAD. Maybe this is triggering some hidden issue or introducing it.

@ashahab ashahab force-pushed the abin-blacklist-resurrection branch from cf1df3e to 6172d1e Compare January 5, 2022 17:18
@ashahab
Copy link
Collaborator Author

ashahab commented Jan 6, 2022

Thanks for the tip @EnricoMi . I figured out the issue, we need to look at why ray tests become flaky on GPU heads. I created the following issue: #3347

@tgaddair @zw0610 can you review this PR?

@zw0610
Copy link
Contributor

zw0610 commented Jan 6, 2022

LGTM. An additional question: if the hostname of a worker is resurrected after cool down period but the worker is still down, as the worker is Not reporting itself, I suppose this worker will not be added to the group until it starts and reports itself. Am I right?

@ashahab
Copy link
Collaborator Author

ashahab commented Jan 6, 2022

@zw0610 No, the nodes would still be added even if they are down. The discovery script is supposed to figure out which nodes are up and return only healthy nodes. If we want to separate the discovery of healthy nodes from the discovery script, we can design in a health check hook, but that is not included in this PR.

@zw0610
Copy link
Contributor

zw0610 commented Jan 6, 2022

@zw0610 No, the nodes would still be added even if they are down. The discovery script is supposed to figure out which nodes are up and return only healthy nodes. If we want to separate the discovery of healthy nodes from the discovery script, we can design in a health check hook, but that is not included in this PR.

@ashahab Sorry for not explaining clearly. What I meant is such worker is not reported by discover_host.sh script but it's already past its cool down time. I suppose the HostManager checks if a worker should be still blacklisted or whitelisted only after this worker being reported by discover_host.sh script again.

@ashahab
Copy link
Collaborator Author

ashahab commented Jan 11, 2022

@zw0610 that is an interesting test case, I'll add that in.
However, the answer to that is no, hosts not in discovery script would never get added, as they all come from discovery.host_slots:

        def whitelist_all_hosts():
            for host in host_slots.keys():
                if self._hosts_state[host].is_resurrected():
                    self._hosts_state[host].whitelist()

        def has_resurrected_hosts():
            resurrected_hosts = [host for host in host_slots.keys() if self._hosts_state[host].is_resurrected()]
            return len(resurrected_hosts) > 0

@zw0610
Copy link
Contributor

zw0610 commented Jan 12, 2022

hosts not in discovery script would never get added

@ashahab That's great! It's exactly the behavior I expect from horovod elastic.

Copy link
Collaborator

@chongxiaoc chongxiaoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Need @tgaddair to take one more look.

docs/elastic.rst Outdated Show resolved Hide resolved
horovod/ray/elastic_v2.py Show resolved Hide resolved
horovod/ray/elastic_v2.py Show resolved Hide resolved
horovod/runner/elastic/discovery.py Outdated Show resolved Hide resolved
horovod/runner/elastic/discovery.py Show resolved Hide resolved
horovod/runner/elastic/discovery.py Outdated Show resolved Hide resolved
horovod/runner/elastic/discovery.py Outdated Show resolved Hide resolved
horovod/runner/elastic/discovery.py Outdated Show resolved Hide resolved
horovod/runner/elastic/discovery.py Outdated Show resolved Hide resolved
horovod/ray/elastic_v2.py Outdated Show resolved Hide resolved
@chongxiaoc
Copy link
Collaborator

Hi @ashahab , have you addressed @EnricoMi 's change request? We can land this PR soon.

@ashahab
Copy link
Collaborator Author

ashahab commented Jan 20, 2022

@chongxiaoc I just pushed the changes, lets wait for the tests. Thank you!

docs/elastic.rst Outdated Show resolved Hide resolved
docs/elastic.rst Outdated Show resolved Hide resolved
horovod/ray/elastic_v2.py Outdated Show resolved Hide resolved
horovod/ray/elastic_v2.py Show resolved Hide resolved
horovod/runner/elastic/discovery.py Show resolved Hide resolved
horovod/runner/elastic/discovery.py Outdated Show resolved Hide resolved
horovod/runner/elastic/discovery.py Outdated Show resolved Hide resolved
horovod/runner/elastic/discovery.py Outdated Show resolved Hide resolved
@EnricoMi
Copy link
Collaborator

@ashahab Updated my review. Can you please update the PR description, specially this part:

For repeat failures the cooldown period grows with an exponential backoff delay: 10s, 20s, 30s. Cooldown period is capped
at 5 minutes.

I think you are referring to the default here, which is different I think. And the example is not exponential.

horovod/ray/elastic_v2.py Outdated Show resolved Hide resolved
@ashahab ashahab force-pushed the abin-blacklist-resurrection branch 2 times, most recently from 13c2134 to cf5dd5c Compare January 21, 2022 07:49
horovod/runner/elastic/discovery.py Outdated Show resolved Hide resolved
This adds support for resurrecting blacklisted hosts in elastic mode.
Currently hosts that get blacklisted remain in the blacklist for the lifetime of the job.
This cannot handle transient host failure or a scale-up after as scale-down.
This is especially the case for the Kubeflow mpi-operator on Kubernetes, as it always
gives pods known hostnames from its hostfile.

This patch will allow blacklisted hosts to become whitelisted after a configured countdown period.
Cooldown periods can be configured with the ``--blacklist-cooldown-range`` parameter like this:

.. code-block:: bash
    $ horovodrun -np 8 --blacklist-cooldown-range 10 100 --min-np 4 --max-np 12 --host-discovery-script discover_hosts.py python train.py

The above example configures the minimum cooldown period to 10 seconds and the maximum cooldown period to 100 seconds.
The intial cooldown period would be 10 seconds. For repeat failures the cooldown period would grow with an exponential
backoff delay: 10s, 20s, 30s, and so on. However, the maximum cooldown period would be capped at 100 seconds, regardless
of failure count. The default behavior is to have no cooldown period, and blacklisted hosts would remain in blacklist.

Signed-off-by: Abin Shahab <ashahab@linkedin.com>
Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@EnricoMi
Copy link
Collaborator

@EnricoMi
Copy link
Collaborator

EnricoMi commented Jan 22, 2022

@ashahab the docs are failing: https://readthedocs.org/projects/horovod/builds/15848706/

Never mind, looks like this exists in master. Has been fixed in #3377.

@EnricoMi EnricoMi merged commit d261b5e into horovod:master Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Implement more forgiving host blacklist policy in elastic mode
6 participants