-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dask] find all needed ports in each host at once (fixes #4458) #4498
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great improvement, thank you! I totally support the approach you took here, and I think it'll improve the stability and speed of the Dask package and its tests.
I've added (fixes #4458)
because I think this should fix #4458.
I just left a few minor comments for your consideration. I'd also like to test this on a true distributed cluster (like with dask-cloudprovider
) using n_procs=2
, just to be sure there isn't any only-works-with-LocalCluster
stuff we're missing. I can do that in the next day or two.
I tested this tonight with The image below shows that one how I tested this (click me)I used the example code and setup at https://github.com/jameslamb/lightgbm-dask-testing/blob/3837f4038ce7393396e11d2c1f3cc78547c28f1a/notebooks/demo-aws.ipynb, but with the cluster construction modified like this: from dask.distributed import Client
from dask_cloudprovider.aws import FargateCluster
n_workers = 3
cluster = FargateCluster(
image=CONTAINER_IMAGE,
worker_cpu=512,
worker_mem=4096,
n_workers=n_workers,
fargate_use_private_ip=False,
scheduler_timeout="40 minutes",
find_address_timeout=60 * 10,
worker_extra_args=["--nprocs=2"]
)
client = Client(cluster)
client.wait_for_workers(n_workers) Here's the value of
Which worked and didn't conflict with the ports Dask was using for communication between workers and the scheduler. Great work!!! |
Haha thanks. I hope we don't ever see a collision in the ports again. |
I'll manually restart the failing Azure DevOps jobs. Looks like they failed to clone some of the submodules.
|
workers = client.scheduler_info()['workers'].keys() | ||
n_workers = len(workers) | ||
host_to_workers = lgb.dask._group_workers_by_host(workers) | ||
for _ in range(1_000): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably this line was the reason of just exceeded time limit QEMU CI job at master
. Will keep an eye on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're probably right!
I didn't consider it, since I've gotten into the habit of just thinking "well, sometimes setup on QEMU builds is really slow, and it's unpredictable whether that job will take 90 minutes or 120 minutes".
But I just compared the logs on latest master
(with this PR merged) to the logs from the previous master
build, and the Dask tests took about 22 minutes longer to run than they did previously.
I'd support a PR to drop this to range(100)
or even range(25)
.
commit with this PR (5fe27d5)
- total runtime: 2h7m
- time for Dask tests: 58m08s
- link: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10646&view=logs&j=380bc479-1396-5123-5f0f-4e5f44cf253e&t=d4eba53f-d934-5d3f-b687-2f058f88f797
2021-08-04T12:35:22.0970407Z ../tests/python_package_test/test_consistency.py ...... [ 9%]
2021-08-04T12:40:11.5075701Z ../tests/python_package_test/test_dask.py .............................. [ 14%]
2021-08-04T12:50:29.7834725Z ........................................................................ [ 26%]
2021-08-04T13:08:08.7403444Z ........................................................................ [ 37%]
2021-08-04T13:26:40.1951149Z ......s...............s...............s...............s................. [ 49%]
2021-08-04T13:36:51.6836691Z ..................................s..................................... [ 61%]
2021-08-04T13:38:19.0058818Z s.......................s........ [ 66%]
2021-08-04T13:38:19.0156018Z ../tests/python_package_test/test_dual.py s [ 66%]
previous commit on master
(1dbf438)
- total runtime: 1h15m
- time for Dask tests: 36m22s
- link: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10642&view=logs&j=c2f9361f-3c13-57db-3206-cee89820d5e3&t=bf5b33f9-9072-549e-d8b1-cab79a1dd2a9
2021-08-03T21:00:22.0875288Z ../tests/python_package_test/test_consistency.py ...... [ 9%]
2021-08-03T21:03:26.1542925Z ../tests/python_package_test/test_dask.py .............................. [ 14%]
2021-08-03T21:09:08.6432850Z ........................................................................ [ 26%]
2021-08-03T21:18:07.7971801Z ........................................................................ [ 37%]
2021-08-03T21:32:02.3425272Z ......s...............s...............s...............s................. [ 49%]
2021-08-03T21:38:51.5159320Z ..................................s..................................... [ 61%]
2021-08-03T21:39:48.7253040Z s.......................s........ [ 66%]
2021-08-03T21:39:48.7338523Z ../tests/python_package_test/test_dual.py s [ 66%]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added @jameslamb's suggestion in #4501
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This replaces my initial proposal (#3823) of using
client.run
to find an open port in each worker, since that could produce collisions (#4057, #4458) when one host had more than one worker process running in it (as is the case withLocalCluster
). The collisions problem was solved in #4133, however it would be desirable to not have any collisions at all, which is what this (hopefully) achieves.