[tests][dask] add scikit-learn compatibility tests (fixes #3894) #3947

imjwang · 2021-02-12T16:34:05Z

sklearn integration tests for dask. This fixes #3894

ghost · 2021-02-12T16:34:19Z

All CLA requirements met.

jameslamb

Thanks very much for working on this! I've updated the PR title and description so that they're descriptive of the work. Linking to the relevant issue in the description helps reviewers to understand the context of why a pull request has been opened. Using an informative title makes the PR easier to understand in the git history and as a changelog entry: https://github.com/microsoft/LightGBM/releases/tag/untagged-edc0a3c487f27c6fbdb9

I'd be happy to review this code once you sign the Contributor License Agreement. Please let me know if you have any questions about it.

imjwang · 2021-02-12T17:42:19Z

No problem I had a lot of fun!
The above link is not working for me.
I also saw that sklearn's compatability tests deep copy the estimator passed in and that may be causing problems with the client.

https://github.com/scikit-learn/scikit-learn/blob/31b34b560de57a049dd435dccc55112271322370/sklearn/utils/estimator_checks.py#L2308-L2316

https://github.com/scikit-learn/scikit-learn/blob/5403e9fdaee6d4982c887ce2ae9a62ccd3955fbb/sklearn/base.py#L29

jameslamb · 2021-02-12T18:00:11Z

The above link is not working for me.

I'm surprised by that, but don't worry about it. I just was showing that we use some automation that automatically adds one changelog entry for each pull request, with the pull request title as the text

I also saw that sklearn's compatability tests deep copy the estimator passed in and that may be causing problems with the client.

Can you please explain specifically what you mean by "causing problems with the client"?

jameslamb

I left some suggestions that might help with the failing tests in CI.

Are you able to test this locally? That might be faster for you than relying on our CI for feedback. If you have trouble building lightgbm locally or running the tests locally, let me know and I can help.

jameslamb · 2021-02-12T18:01:51Z

tests/python_package_test/test_dask.py

+        "estimator, check",
+        _generate_checks_per_estimator(_yield_all_checks, _tested_estimators()),
+    )
+    def test_sklearn_integration(estimator, check):


please pass the client fixture into this test and the one below. It should fix the errors like this that I see in CI.

E ValueError: No clients found

Are you referring to test_dask.py? I have built lightgbm on WSL Ubuntu 18.04 and I can run test_dask.py and test_sklearn.py

tests/python_package_test/test_dask.py

Co-authored-by: James Lamb <jaylamb20@gmail.com>

imjwang · 2021-02-12T18:26:43Z

Several of the tests output this error when I run locally
x = <Task pending name='Task-2875' coro=<Client._handle_report() running at /home/jwng/anaconda3/lib/python3.8/site-packages/distributed/client.py:1223> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f6cc43a72b0>()]>> memo = {140104918447248: <[AttributeError("'Client' object has no attribute '_scheduler_identity'") raised in repr()] Client ...04896: [{}, defaultdict(<function Client.__init__.<locals>.<lambda> at 0x7f6cc43afa60>, {})], 140105126224448: [], ...} _nil = []

    def deepcopy(x, memo=None, _nil=[]):
        """Deep copy operation on arbitrary Python objects.

        See the module's __doc__ string for more info.
        """

        if memo is None:
            memo = {}

        d = id(x)
        y = memo.get(d, _nil)
        if y is not _nil:
            return y

        cls = type(x)

        copier = _deepcopy_dispatch.get(cls)
        if copier is not None:
            y = copier(x, memo)
        else:
            if issubclass(cls, type):
                y = _deepcopy_atomic(x, memo)
            else:
                copier = getattr(x, "__deepcopy__", None)
                if copier is not None:
                    y = copier(memo)
                else:
                    reductor = dispatch_table.get(cls)
                    if reductor:
                        rv = reductor(x)
                    else:
                        reductor = getattr(x, "__reduce_ex__", None)
                        if reductor is not None:
>                           rv = reductor(4)


E                           TypeError: cannot pickle '_asyncio.Task' object

../anaconda3/lib/python3.8/copy.py:161: TypeError

jameslamb · 2021-02-12T19:14:30Z

tests/python_package_test/test_dask.py

+        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+           s.bind(('127.0.0.1', 12400))


what is the purpose of these socket.socket() lines? Are you just checking that this port is available?

lightgbm.dask already does that automatically, so this is not necessary. Can you please remove these?

StrikerRUS · 2021-02-12T19:15:13Z

@jameslamb

The above link is not working for me.
I'm surprised by that

Draft Releases can be seen only by maintainers.
Try to open your link in Incognito mode to check that.

jameslamb · 2021-02-12T19:16:11Z

@jameslamb

The above link is not working for me.
I'm surprised by that

Draft Releases can be seen only by maintainers.
Try to open your link in Incognito mode to check that.

oh didn't realize! I thought they could only be EDITED by maintainers. Ok, makes sense.

jameslamb · 2021-02-12T19:16:50Z

Several of the tests output this error when I run locally

Ok, do you need help debugging?

imjwang · 2021-02-12T21:15:03Z

Several of the tests output this error when I run locally

Ok, do you need help debugging?
@jameslamb
Yes please, I have not been able to fix the errors. Am I doing something wrong with the clients?

jameslamb · 2021-02-12T21:43:04Z

Am I doing something wrong with the clients?

I'll try running this myself a bit. I don't know what the internals of those checks imported from sklearn do, so I have to do some research. A Dask distributed.Client object is note pickle-able, so if its returned in any result that those tests then try to pickle, we'll have to work around that.

imjwang · 2021-02-13T00:38:40Z

I also saw that sklearn's compatability tests deep copy the estimator passed in and that may be causing problems with the client.

Can you please explain specifically what you mean by "causing problems with the client"?

My understanding is that one of these functions are run in test_dask.py , and both yield the sklearn tests for different sklearn versions.

LightGBM/tests/python_package_test/test_sklearn.py

Line 1179 in 84c4b75

@parametrize_with_checks(list(_tested_estimators()))

LightGBM/tests/python_package_test/test_sklearn.py

Lines 1165 to 1168 in 84c4b75

    
               @pytest.mark.parametrize( 
        
                   "estimator, check", 
        
                   _generate_checks_per_estimator(_yield_all_checks, _tested_estimators()), 
        
               )

I looked at parametrize_with_checks() and the checks on their github here and it seems like most of them have calls to a clone(). Which copies the passed in estimator and also call a copy.deepcopy() of the original estimator's params before running each test on a cloned estimator.
I think they are using copy.deepcopy() from the python standard lib
And I think that is behind the TypeError: cannot pickle '_asyncio.Task' object

jameslamb · 2021-02-15T06:41:57Z

ok @imjwang I did some investigation. There is a lot here, sorry. Please don't change anything in this PR until @StrikerRUS also gives an opinion on my comments below.

Why so many of the tests in this PR are failing

I think we have to skip any of the tests from https://github.com/scikit-learn/scikit-learn/blob/31b34b560de57a049dd435dccc55112271322370/sklearn/utils/estimator_checks.py that involve calling .fit() or .predict(). Those methods have code in them that passes in numpy arrays, and the Dask estimators can only take in data inputs that are Dask collections.

Many of the tests fail with errors like this

E AttributeError: 'numpy.ndarray' object has no attribute 'to_delayed'

or

E TypeError: Data must be either Dask Array or Dask DataFrame. Got <class 'numpy.ndarray'>.

What I think should be done in this PR

So I think that for now, the best we can do is specifically select the checks that don't have .fit() or .predict() calls in them. I went through sklearn.utils.estimator_checks that module and picked these out:

check_parameters_default_constructible()
check_get_params_invariance()
check_set_params()
check_estimator_get_tags_default_keys()

It's not a lot, but at least it will catch a few things.

One nice side-effect of this is that the code for different versions of scikit-learn can be removed.

import sklearn.utils.estimator_checks as sklearn_checks

_check_names = [
    "check_estimator_get_tags_default_keys",
    "check_get_params_invariance",
    "check_set_params"
]
sklearn_checks_to_run = []
for check_name in _check_names:
    check_func = getattr(sklearn_checks, check_name, None)
    if check_func:
        sklearn_checks_to_run.append(check_func)
 
 def _tested_estimators():
    for Estimator in [lgb.DaskLGBMClassifier, lgb.DaskLGBMRegressor]:
        yield Estimator()


@pytest.mark.parametrize("estimator", _tested_estimators())
@pytest.mark.parametrize("check", sklearn_checks_to_run)
def test_sklearn_integration(estimator, check, client):
    estimator.set_params(local_listen_port=18000, time_out=5)
    name = type(estimator).__name__
    check(name, estimator)
    client.close(timeout=CLIENT_CLOSE_TIMEOUT)


# this test is separate because it takes a not-yet-constructed estimator
@pytest.mark.parametrize("estimator", list(_tested_estimators()))
def test_parameters_default_constructible(estimator):
    name, Estimator = estimator.__class__.__name__, estimator.__class__
    sklearn_checks.check_parameters_default_constructible(name, Estimator)

I tested this and the tests passed for both scikit-learn 0.23.2 and scikit-learn 0.22.2.

How we can get better coverage of scikit-learn compatibility in the future

I've opened dask/dask-ml#796 in dask-ml, a feature request describing a Dask equivalent of these checks.

Separate Question

I explicitly omitted check_no_attributes_set_in_init() because it fails for the Dask classifiers. @StrikerRUS how is this not also breaking in the sklearn tests?

import lightgbm as lgb
from sklearn.utils.estimator_checks import check_no_attributes_set_in_init
check_no_attributes_set_in_init("LGBMClassifier", lgb.LGBMClassifier())

AssertionError: Estimator LGBMClassifier should not set any attribute apart from parameters during init. Found attributes ['_Booster', '_best_iteration', '_best_score', '_class_map', '_class_weight', '_classes', '_evals_result', '_n_classes', '_n_features', '_n_features_in', '_objective', '_other_params']

StrikerRUS · 2021-02-16T12:29:55Z

@jameslamb

I explicitly omitted check_no_attributes_set_in_init() because it fails for the Dask classifiers. @StrikerRUS how is this not also breaking in the sklearn tests?

This test is skipped.

LightGBM/python-package/lightgbm/sklearn.py

Lines 498 to 502 in 4ae5949

    
           '_xfail_checks': { 
        
               'check_no_attributes_set_in_init': 
        
               'scikit-learn incorrectly asserts that private attributes ' 
        
               'cannot be set in __init__: ' 
        
               '(see https://github.com/microsoft/LightGBM/issues/2628)'

scikit-learn/scikit-learn#16241

Ah, I see! The majority of tests passes numpy arrays into fit() which is prohibited by lightgbm.dask API. I should thought about it earlier, sorry!
It's great that you made dask/dask-ml#796 feature request! I love it!

TBH, I don't think that applying a few separate tests from the whole scikit-learn test suite will make a lot of sense... Checking "compatibility" in a such way may be confusing and incorrectly "relaxing" us. I believe it's better to think that Dask estimators are not compatible with scikit-learn API for now.
However, it is not my strong opinion and I'm quite OK with having those few tests at our CI.

jameslamb · 2021-02-16T18:00:08Z

This test is skipped.

LightGBM/python-package/lightgbm/sklearn.py

Lines 498 to 502 in 4ae5949

'_xfail_checks': {

'check_no_attributes_set_in_init':

'scikit-learn incorrectly asserts that private attributes '

'cannot be set in __init__: '

'(see https://github.com/microsoft/LightGBM/issues/2628)'

Ah, missed that! Didn't think to look for testing-specific stuff in the sklearn module itself. Got it.

However, it is not my strong opinion and I'm quite OK with having those few tests at our CI.

I think that we should keep this handful of tests. Some of them capture things that you had to spend time and energy teaching me on #3883. Having those checks in tests might save us similar reviewing effort in the future. If contributors propose a pull request that fails one of those checks, then the tests will tell them what went wrong and what they should fix.

@imjwang could you please update this PR to latest master and change this PR using my suggestion from #3947 (comment)?

imjwang · 2021-02-16T18:43:16Z

@jameslamb Yes, sure thing

imjwang · 2021-02-16T22:31:14Z

The current code fails the lint test for imports and I'm very confused. The only line I added was import sklearn.utils.estimator_checks as sklearn_checks and I put it in alphabetical order. Is there a guide I can look at?

jameslamb · 2021-02-16T22:51:44Z

oh sorry! We very very recently added isort to the linting job.

Run the following to fix it.

pip install isort
isort .

jameslamb

looks great, thanks for the contribution! I'd like to request one more review from @StrikerRUS before we merge this.

StrikerRUS · 2021-02-17T12:28:35Z

tests/python_package_test/test_dask.py

+_check_names = [
+    "check_estimator_get_tags_default_keys",
+    "check_get_params_invariance",
+    "check_set_params"
+]
+sklearn_checks_to_run = []
+for check_name in _check_names:
+    check_func = getattr(sklearn_checks, check_name, None)
+    if check_func:
+        sklearn_checks_to_run.append(check_func)


Please convert this piece of code into a function. Let's keep code organized.

def sklearn_checks_to_run(): check_names = [ "check_estimator_get_tags_default_keys", "check_get_params_invariance", "check_set_params" ] for check_name in check_names: check_func = getattr(sklearn_checks, check_name, None) if check_func: yield check_func

imjwang · 2021-02-17T20:55:38Z

@StrikerRUS Does this work?

StrikerRUS

@imjwang Thank you for your contribution!

imjwang · 2021-02-18T03:47:02Z

@jameslamb @StrikerRUS Thanks for the opportunity and support! I've learned a good bit about testing and CI and I am happy to contribute to this project.

github-actions · 2023-08-24T01:13:38Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

add test_dask.py

87185c5

imjwang requested a review from jameslamb as a code owner February 12, 2021 16:34

jameslamb changed the title ~~tests~~ [dask] add scikit-learn compatibility tests (fixes #3894) Feb 12, 2021

jameslamb added the maintenance label Feb 12, 2021

jameslamb requested changes Feb 12, 2021

View reviewed changes

imjwang and others added 2 commits February 12, 2021 13:12

Update tests/python_package_test/test_dask.py

276a5a0

Co-authored-by: James Lamb <jaylamb20@gmail.com>

clients

0a8c5fe

jameslamb requested changes Feb 12, 2021

View reviewed changes

remove ports

8dfb077

jameslamb requested a review from StrikerRUS February 15, 2021 06:43

imjwang and others added 4 commits February 16, 2021 15:30

Merge branch 'master' into dask-sklearn-compat

527b49f

safe sklearn checks

5715343

safe sklearn checks

a1598d9

fix whitespace

32b5a3e

jameslamb added the in progress label Feb 16, 2021

imjwang added 2 commits February 16, 2021 16:54

fix whitespace-try 2

b9ed0f2

fix whitespace-try 3

a1fe810

imjwang added 2 commits February 16, 2021 21:46

isort

9133333

isort

00b164d

jameslamb self-requested a review February 17, 2021 06:34

jameslamb approved these changes Feb 17, 2021

View reviewed changes

StrikerRUS removed the in progress label Feb 17, 2021

StrikerRUS requested changes Feb 17, 2021

View reviewed changes

sklearn_checks_to_learn

48cd3f1

StrikerRUS approved these changes Feb 18, 2021

View reviewed changes

StrikerRUS changed the title ~~[dask] add scikit-learn compatibility tests (fixes #3894)~~ [tests][dask] add scikit-learn compatibility tests (fixes #3894) Feb 18, 2021

StrikerRUS merged commit eb5f471 into microsoft:master Feb 18, 2021

StrikerRUS mentioned this pull request Feb 18, 2021

[dask] [python-package] include support for column array as label #3943

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tests][dask] add scikit-learn compatibility tests (fixes #3894) #3947

[tests][dask] add scikit-learn compatibility tests (fixes #3894) #3947

imjwang commented Feb 12, 2021 •

edited by jameslamb

ghost commented Feb 12, 2021 •

edited by ghost

jameslamb left a comment

imjwang commented Feb 12, 2021

jameslamb commented Feb 12, 2021

jameslamb left a comment

jameslamb Feb 12, 2021

imjwang Feb 12, 2021 •

edited

imjwang commented Feb 12, 2021

jameslamb Feb 12, 2021

StrikerRUS commented Feb 12, 2021

jameslamb commented Feb 12, 2021

jameslamb commented Feb 12, 2021

imjwang commented Feb 12, 2021

jameslamb commented Feb 12, 2021

imjwang commented Feb 13, 2021

jameslamb commented Feb 15, 2021 •

edited by StrikerRUS

StrikerRUS commented Feb 16, 2021

jameslamb commented Feb 16, 2021 •

edited

imjwang commented Feb 16, 2021

imjwang commented Feb 16, 2021

jameslamb commented Feb 16, 2021

jameslamb left a comment

StrikerRUS Feb 17, 2021

imjwang commented Feb 17, 2021 •

edited

StrikerRUS left a comment

imjwang commented Feb 18, 2021

github-actions bot commented Aug 24, 2023

		with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
		s.bind(('127.0.0.1', 12400))

[tests][dask] add scikit-learn compatibility tests (fixes #3894) #3947

[tests][dask] add scikit-learn compatibility tests (fixes #3894) #3947

Conversation

imjwang commented Feb 12, 2021 • edited by jameslamb

ghost commented Feb 12, 2021 • edited by ghost

jameslamb left a comment

Choose a reason for hiding this comment

imjwang commented Feb 12, 2021

jameslamb commented Feb 12, 2021

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb Feb 12, 2021

Choose a reason for hiding this comment

imjwang Feb 12, 2021 • edited

Choose a reason for hiding this comment

imjwang commented Feb 12, 2021

jameslamb Feb 12, 2021

Choose a reason for hiding this comment

StrikerRUS commented Feb 12, 2021

jameslamb commented Feb 12, 2021

jameslamb commented Feb 12, 2021

imjwang commented Feb 12, 2021

jameslamb commented Feb 12, 2021

imjwang commented Feb 13, 2021

jameslamb commented Feb 15, 2021 • edited by StrikerRUS

Why so many of the tests in this PR are failing

What I think should be done in this PR

How we can get better coverage of scikit-learn compatibility in the future

Separate Question

StrikerRUS commented Feb 16, 2021

jameslamb commented Feb 16, 2021 • edited

imjwang commented Feb 16, 2021

imjwang commented Feb 16, 2021

jameslamb commented Feb 16, 2021

jameslamb left a comment

Choose a reason for hiding this comment

StrikerRUS Feb 17, 2021

Choose a reason for hiding this comment

imjwang commented Feb 17, 2021 • edited

StrikerRUS left a comment

Choose a reason for hiding this comment

imjwang commented Feb 18, 2021

github-actions bot commented Aug 24, 2023

imjwang commented Feb 12, 2021 •

edited by jameslamb

ghost commented Feb 12, 2021 •

edited by ghost

imjwang Feb 12, 2021 •

edited

jameslamb commented Feb 15, 2021 •

edited by StrikerRUS

jameslamb commented Feb 16, 2021 •

edited

imjwang commented Feb 17, 2021 •

edited