Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] find all needed ports in each host at once (fixes #4458) #4498

Merged
merged 4 commits into from
Aug 3, 2021

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Aug 1, 2021

This replaces my initial proposal (#3823) of using client.run to find an open port in each worker, since that could produce collisions (#4057, #4458) when one host had more than one worker process running in it (as is the case with LocalCluster). The collisions problem was solved in #4133, however it would be desirable to not have any collisions at all, which is what this (hopefully) achieves.

@jmoralez jmoralez changed the title [dask] find all needed ports in each worker at once [dask] find all needed ports in each host at once Aug 1, 2021
@jameslamb jameslamb added the fix label Aug 2, 2021
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement, thank you! I totally support the approach you took here, and I think it'll improve the stability and speed of the Dask package and its tests.

I've added (fixes #4458) because I think this should fix #4458.

I just left a few minor comments for your consideration. I'd also like to test this on a true distributed cluster (like with dask-cloudprovider) using n_procs=2, just to be sure there isn't any only-works-with-LocalCluster stuff we're missing. I can do that in the next day or two.

python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
python-package/lightgbm/dask.py Outdated Show resolved Hide resolved
@jameslamb jameslamb changed the title [dask] find all needed ports in each host at once [dask] find all needed ports in each host at once (fixes #4458) Aug 2, 2021
@jameslamb
Copy link
Collaborator

I tested this tonight with dask-cloudprovider, worked great! I ran a FargateCluster with n_workers=3 (three ECS tasks) and nprocs=2 (two worker processes per ECS task).

The image below shows that one _find_n_open_ports() task ran per ECS task, but the one _train_part() task ran per worker process.

how I tested this (click me)

I used the example code and setup at https://github.com/jameslamb/lightgbm-dask-testing/blob/3837f4038ce7393396e11d2c1f3cc78547c28f1a/notebooks/demo-aws.ipynb, but with the cluster construction modified like this:

from dask.distributed import Client
from dask_cloudprovider.aws import FargateCluster

n_workers = 3
cluster = FargateCluster(
    image=CONTAINER_IMAGE,
    worker_cpu=512,
    worker_mem=4096,
    n_workers=n_workers,
    fargate_use_private_ip=False,
    scheduler_timeout="40 minutes",
    find_address_timeout=60 * 10,
    worker_extra_args=["--nprocs=2"]
)
client = Client(cluster)
client.wait_for_workers(n_workers)

image

Here's the value of machines after running training (split onto one item per line).

172.31.63.34:37953
172.31.63.34:49763
172.31.5.137:43329
172.31.5.137:39591
172.31.2.25:48257
172.31.2.25:37347

Which worked and didn't conflict with the ports Dask was using for communication between workers and the scheduler.

image

Great work!!!

@jmoralez
Copy link
Collaborator Author

jmoralez commented Aug 3, 2021

Haha thanks. I hope we don't ever see a collision in the ports again.

@jameslamb
Copy link
Collaborator

I'll manually restart the failing Azure DevOps jobs. Looks like they failed to clone some of the submodules.

e.g. https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=10637&view=logs&j=c2f9361f-3c13-57db-3206-cee89820d5e3&t=7ece8bab-5221-51ef-06d4-fdfda5931fc0

HEAD is now at e4ba7832 Merge 7fdf8aec5a62447cf2a7c00dcc35c54051b5d266 into 75e486a6fa2a02a76024b4622d7aba3e13084ad4
git submodule sync
git -c http.https://github.com.extraheader="AUTHORIZATION: basic ***" submodule update --init --force
Submodule 'include/boost/compute' (https://github.com/boostorg/compute) registered for path 'external_libs/compute'
Submodule 'eigen' (https://gitlab.com/libeigen/eigen.git) registered for path 'external_libs/eigen'
Submodule 'external_libs/fast_double_parser' (https://github.com/lemire/fast_double_parser.git) registered for path 'external_libs/fast_double_parser'
Submodule 'external_libs/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'external_libs/fmt'
Cloning into '/home/vsts/work/1/s/external_libs/compute'...
Cloning into '/home/vsts/work/1/s/external_libs/eigen'...
error: 2465 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
fatal: clone of 'https://gitlab.com/libeigen/eigen.git' into submodule path '/home/vsts/work/1/s/external_libs/eigen' failed
Failed to clone 'external_libs/eigen'. Retry scheduled
Cloning into '/home/vsts/work/1/s/external_libs/fast_double_parser'...
Cloning into '/home/vsts/work/1/s/external_libs/fmt'...
Cloning into '/home/vsts/work/1/s/external_libs/eigen'...
error: 5919 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output
fatal: clone of 'https://gitlab.com/libeigen/eigen.git' into submodule path '/home/vsts/work/1/s/external_libs/eigen' failed
Failed to clone 'external_libs/eigen' a second time, aborting

@jameslamb jameslamb self-requested a review August 3, 2021 15:13
@jameslamb jameslamb merged commit 5fe27d5 into microsoft:master Aug 3, 2021
workers = client.scheduler_info()['workers'].keys()
n_workers = len(workers)
host_to_workers = lgb.dask._group_workers_by_host(workers)
for _ in range(1_000):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this line was the reason of just exceeded time limit QEMU CI job at master. Will keep an eye on it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're probably right!

I didn't consider it, since I've gotten into the habit of just thinking "well, sometimes setup on QEMU builds is really slow, and it's unpredictable whether that job will take 90 minutes or 120 minutes".

But I just compared the logs on latest master (with this PR merged) to the logs from the previous master build, and the Dask tests took about 22 minutes longer to run than they did previously.

I'd support a PR to drop this to range(100) or even range(25).

commit with this PR (5fe27d5)

2021-08-04T12:35:22.0970407Z ../tests/python_package_test/test_consistency.py ......                  [  9%]
2021-08-04T12:40:11.5075701Z ../tests/python_package_test/test_dask.py .............................. [ 14%]
2021-08-04T12:50:29.7834725Z ........................................................................ [ 26%]
2021-08-04T13:08:08.7403444Z ........................................................................ [ 37%]
2021-08-04T13:26:40.1951149Z ......s...............s...............s...............s................. [ 49%]
2021-08-04T13:36:51.6836691Z ..................................s..................................... [ 61%]
2021-08-04T13:38:19.0058818Z s.......................s........                                        [ 66%]
2021-08-04T13:38:19.0156018Z ../tests/python_package_test/test_dual.py s                              [ 66%]

previous commit on master (1dbf438)

2021-08-03T21:00:22.0875288Z ../tests/python_package_test/test_consistency.py ......                  [  9%]
2021-08-03T21:03:26.1542925Z ../tests/python_package_test/test_dask.py .............................. [ 14%]
2021-08-03T21:09:08.6432850Z ........................................................................ [ 26%]
2021-08-03T21:18:07.7971801Z ........................................................................ [ 37%]
2021-08-03T21:32:02.3425272Z ......s...............s...............s...............s................. [ 49%]
2021-08-03T21:38:51.5159320Z ..................................s..................................... [ 61%]
2021-08-03T21:39:48.7253040Z s.......................s........                                        [ 66%]
2021-08-03T21:39:48.7338523Z ../tests/python_package_test/test_dual.py s                              [ 66%]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added @jameslamb's suggestion in #4501

@jmoralez jmoralez deleted the fix/port-collisions branch August 4, 2021 15:26
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants