Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests][dask] add scikit-learn compatibility tests (fixes #3894) #3947

Merged
merged 13 commits into from
Feb 18, 2021

Conversation

imjwang
Copy link
Contributor

@imjwang imjwang commented Feb 12, 2021

sklearn integration tests for dask. This fixes #3894

@ghost
Copy link

ghost commented Feb 12, 2021

CLA assistant check
All CLA requirements met.

@jameslamb jameslamb changed the title tests [dask] add scikit-learn compatibility tests (fixes #3894) Feb 12, 2021
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for working on this! I've updated the PR title and description so that they're descriptive of the work. Linking to the relevant issue in the description helps reviewers to understand the context of why a pull request has been opened. Using an informative title makes the PR easier to understand in the git history and as a changelog entry: https://github.com/microsoft/LightGBM/releases/tag/untagged-edc0a3c487f27c6fbdb9

I'd be happy to review this code once you sign the Contributor License Agreement. Please let me know if you have any questions about it.

@imjwang
Copy link
Contributor Author

imjwang commented Feb 12, 2021

No problem I had a lot of fun!
The above link is not working for me.
I also saw that sklearn's compatability tests deep copy the estimator passed in and that may be causing problems with the client.

https://github.com/scikit-learn/scikit-learn/blob/31b34b560de57a049dd435dccc55112271322370/sklearn/utils/estimator_checks.py#L2308-L2316

https://github.com/scikit-learn/scikit-learn/blob/5403e9fdaee6d4982c887ce2ae9a62ccd3955fbb/sklearn/base.py#L29

@jameslamb
Copy link
Collaborator

The above link is not working for me.

I'm surprised by that, but don't worry about it. I just was showing that we use some automation that automatically adds one changelog entry for each pull request, with the pull request title as the text

image

I also saw that sklearn's compatability tests deep copy the estimator passed in and that may be causing problems with the client.

Can you please explain specifically what you mean by "causing problems with the client"?

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some suggestions that might help with the failing tests in CI.

Are you able to test this locally? That might be faster for you than relying on our CI for feedback. If you have trouble building lightgbm locally or running the tests locally, let me know and I can help.

"estimator, check",
_generate_checks_per_estimator(_yield_all_checks, _tested_estimators()),
)
def test_sklearn_integration(estimator, check):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please pass the client fixture into this test and the one below. It should fix the errors like this that I see in CI.

E ValueError: No clients found

Copy link
Contributor Author

@imjwang imjwang Feb 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to test_dask.py? I have built lightgbm on WSL Ubuntu 18.04 and I can run test_dask.py and test_sklearn.py

tests/python_package_test/test_dask.py Outdated Show resolved Hide resolved
imjwang and others added 2 commits February 12, 2021 13:12
Co-authored-by: James Lamb <jaylamb20@gmail.com>
@imjwang
Copy link
Contributor Author

imjwang commented Feb 12, 2021

Several of the tests output this error when I run locally
x = <Task pending name='Task-2875' coro=<Client._handle_report() running at /home/jwng/anaconda3/lib/python3.8/site-packages/distributed/client.py:1223> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f6cc43a72b0>()]>> memo = {140104918447248: <[AttributeError("'Client' object has no attribute '_scheduler_identity'") raised in repr()] Client ...04896: [{}, defaultdict(<function Client.__init__.<locals>.<lambda> at 0x7f6cc43afa60>, {})], 140105126224448: [], ...} _nil = []

    def deepcopy(x, memo=None, _nil=[]):
        """Deep copy operation on arbitrary Python objects.

        See the module's __doc__ string for more info.
        """

        if memo is None:
            memo = {}

        d = id(x)
        y = memo.get(d, _nil)
        if y is not _nil:
            return y

        cls = type(x)

        copier = _deepcopy_dispatch.get(cls)
        if copier is not None:
            y = copier(x, memo)
        else:
            if issubclass(cls, type):
                y = _deepcopy_atomic(x, memo)
            else:
                copier = getattr(x, "__deepcopy__", None)
                if copier is not None:
                    y = copier(memo)
                else:
                    reductor = dispatch_table.get(cls)
                    if reductor:
                        rv = reductor(x)
                    else:
                        reductor = getattr(x, "__reduce_ex__", None)
                        if reductor is not None:
>                           rv = reductor(4)

E                           TypeError: cannot pickle '_asyncio.Task' object

../anaconda3/lib/python3.8/copy.py:161: TypeError

Comment on lines 1118 to 1119
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(('127.0.0.1', 12400))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of these socket.socket() lines? Are you just checking that this port is available?

lightgbm.dask already does that automatically, so this is not necessary. Can you please remove these?

@StrikerRUS
Copy link
Collaborator

@jameslamb

The above link is not working for me.
I'm surprised by that

Draft Releases can be seen only by maintainers.
Try to open your link in Incognito mode to check that.

@jameslamb
Copy link
Collaborator

@jameslamb

The above link is not working for me.
I'm surprised by that

Draft Releases can be seen only by maintainers.
Try to open your link in Incognito mode to check that.

oh didn't realize! I thought they could only be EDITED by maintainers. Ok, makes sense.

@jameslamb
Copy link
Collaborator

Several of the tests output this error when I run locally

Ok, do you need help debugging?

@imjwang
Copy link
Contributor Author

imjwang commented Feb 12, 2021

Several of the tests output this error when I run locally

Ok, do you need help debugging?
@jameslamb
Yes please, I have not been able to fix the errors. Am I doing something wrong with the clients?

@jameslamb
Copy link
Collaborator

Am I doing something wrong with the clients?

I'll try running this myself a bit. I don't know what the internals of those checks imported from sklearn do, so I have to do some research. A Dask distributed.Client object is note pickle-able, so if its returned in any result that those tests then try to pickle, we'll have to work around that.

@imjwang
Copy link
Contributor Author

imjwang commented Feb 13, 2021

I also saw that sklearn's compatability tests deep copy the estimator passed in and that may be causing problems with the client.

Can you please explain specifically what you mean by "causing problems with the client"?

My understanding is that one of these functions are run in test_dask.py , and both yield the sklearn tests for different sklearn versions.

@parametrize_with_checks(list(_tested_estimators()))

@pytest.mark.parametrize(
"estimator, check",
_generate_checks_per_estimator(_yield_all_checks, _tested_estimators()),
)

I looked at parametrize_with_checks() and the checks on their github here and it seems like most of them have calls to a clone(). Which copies the passed in estimator and also call a copy.deepcopy() of the original estimator's params before running each test on a cloned estimator.
I think they are using copy.deepcopy() from the python standard lib
And I think that is behind the TypeError: cannot pickle '_asyncio.Task' object

@jameslamb
Copy link
Collaborator

jameslamb commented Feb 15, 2021

ok @imjwang I did some investigation. There is a lot here, sorry. Please don't change anything in this PR until @StrikerRUS also gives an opinion on my comments below.


Why so many of the tests in this PR are failing

I think we have to skip any of the tests from https://github.com/scikit-learn/scikit-learn/blob/31b34b560de57a049dd435dccc55112271322370/sklearn/utils/estimator_checks.py that involve calling .fit() or .predict(). Those methods have code in them that passes in numpy arrays, and the Dask estimators can only take in data inputs that are Dask collections.

Many of the tests fail with errors like this

E AttributeError: 'numpy.ndarray' object has no attribute 'to_delayed'

or

E TypeError: Data must be either Dask Array or Dask DataFrame. Got <class 'numpy.ndarray'>.

What I think should be done in this PR

So I think that for now, the best we can do is specifically select the checks that don't have .fit() or .predict() calls in them. I went through sklearn.utils.estimator_checks that module and picked these out:

  • check_parameters_default_constructible()
  • check_get_params_invariance()
  • check_set_params()
  • check_estimator_get_tags_default_keys()

It's not a lot, but at least it will catch a few things.

One nice side-effect of this is that the code for different versions of scikit-learn can be removed.

import sklearn.utils.estimator_checks as sklearn_checks

_check_names = [
    "check_estimator_get_tags_default_keys",
    "check_get_params_invariance",
    "check_set_params"
]
sklearn_checks_to_run = []
for check_name in _check_names:
    check_func = getattr(sklearn_checks, check_name, None)
    if check_func:
        sklearn_checks_to_run.append(check_func)
 
 def _tested_estimators():
    for Estimator in [lgb.DaskLGBMClassifier, lgb.DaskLGBMRegressor]:
        yield Estimator()


@pytest.mark.parametrize("estimator", _tested_estimators())
@pytest.mark.parametrize("check", sklearn_checks_to_run)
def test_sklearn_integration(estimator, check, client):
    estimator.set_params(local_listen_port=18000, time_out=5)
    name = type(estimator).__name__
    check(name, estimator)
    client.close(timeout=CLIENT_CLOSE_TIMEOUT)


# this test is separate because it takes a not-yet-constructed estimator
@pytest.mark.parametrize("estimator", list(_tested_estimators()))
def test_parameters_default_constructible(estimator):
    name, Estimator = estimator.__class__.__name__, estimator.__class__
    sklearn_checks.check_parameters_default_constructible(name, Estimator)

I tested this and the tests passed for both scikit-learn 0.23.2 and scikit-learn 0.22.2.

How we can get better coverage of scikit-learn compatibility in the future

I've opened dask/dask-ml#796 in dask-ml, a feature request describing a Dask equivalent of these checks.


Separate Question

I explicitly omitted check_no_attributes_set_in_init() because it fails for the Dask classifiers. @StrikerRUS how is this not also breaking in the sklearn tests?

import lightgbm as lgb
from sklearn.utils.estimator_checks import check_no_attributes_set_in_init
check_no_attributes_set_in_init("LGBMClassifier", lgb.LGBMClassifier())

AssertionError: Estimator LGBMClassifier should not set any attribute apart from parameters during init. Found attributes ['_Booster', '_best_iteration', '_best_score', '_class_map', '_class_weight', '_classes', '_evals_result', '_n_classes', '_n_features', '_n_features_in', '_objective', '_other_params']

@StrikerRUS
Copy link
Collaborator

@jameslamb

I explicitly omitted check_no_attributes_set_in_init() because it fails for the Dask classifiers. @StrikerRUS how is this not also breaking in the sklearn tests?

This test is skipped.

'_xfail_checks': {
'check_no_attributes_set_in_init':
'scikit-learn incorrectly asserts that private attributes '
'cannot be set in __init__: '
'(see https://github.com/microsoft/LightGBM/issues/2628)'

scikit-learn/scikit-learn#16241

Ah, I see! The majority of tests passes numpy arrays into fit() which is prohibited by lightgbm.dask API. I should thought about it earlier, sorry!
It's great that you made dask/dask-ml#796 feature request! I love it!

TBH, I don't think that applying a few separate tests from the whole scikit-learn test suite will make a lot of sense... Checking "compatibility" in a such way may be confusing and incorrectly "relaxing" us. I believe it's better to think that Dask estimators are not compatible with scikit-learn API for now.
However, it is not my strong opinion and I'm quite OK with having those few tests at our CI.

@jameslamb
Copy link
Collaborator

jameslamb commented Feb 16, 2021

This test is skipped.

'_xfail_checks': {
'check_no_attributes_set_in_init':
'scikit-learn incorrectly asserts that private attributes '
'cannot be set in __init__: '
'(see https://github.com/microsoft/LightGBM/issues/2628)'

Ah, missed that! Didn't think to look for testing-specific stuff in the sklearn module itself. Got it.

However, it is not my strong opinion and I'm quite OK with having those few tests at our CI.

I think that we should keep this handful of tests. Some of them capture things that you had to spend time and energy teaching me on #3883. Having those checks in tests might save us similar reviewing effort in the future. If contributors propose a pull request that fails one of those checks, then the tests will tell them what went wrong and what they should fix.

@imjwang could you please update this PR to latest master and change this PR using my suggestion from #3947 (comment)?

@imjwang
Copy link
Contributor Author

imjwang commented Feb 16, 2021

@jameslamb Yes, sure thing

@imjwang
Copy link
Contributor Author

imjwang commented Feb 16, 2021

The current code fails the lint test for imports and I'm very confused. The only line I added was import sklearn.utils.estimator_checks as sklearn_checks and I put it in alphabetical order. Is there a guide I can look at?

@jameslamb
Copy link
Collaborator

oh sorry! We very very recently added isort to the linting job.

Run the following to fix it.

pip install isort
isort .

@jameslamb jameslamb self-requested a review February 17, 2021 06:34
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thanks for the contribution! I'd like to request one more review from @StrikerRUS before we merge this.

Comment on lines 1085 to 1094
_check_names = [
"check_estimator_get_tags_default_keys",
"check_get_params_invariance",
"check_set_params"
]
sklearn_checks_to_run = []
for check_name in _check_names:
check_func = getattr(sklearn_checks, check_name, None)
if check_func:
sklearn_checks_to_run.append(check_func)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please convert this piece of code into a function. Let's keep code organized.

def sklearn_checks_to_run():
    check_names = [
        "check_estimator_get_tags_default_keys",
        "check_get_params_invariance",
        "check_set_params"
    ]
    for check_name in check_names:
        check_func = getattr(sklearn_checks, check_name, None)
        if check_func:
            yield check_func

@imjwang
Copy link
Contributor Author

imjwang commented Feb 17, 2021

@StrikerRUS Does this work?

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imjwang Thank you for your contribution!

@StrikerRUS StrikerRUS changed the title [dask] add scikit-learn compatibility tests (fixes #3894) [tests][dask] add scikit-learn compatibility tests (fixes #3894) Feb 18, 2021
@StrikerRUS StrikerRUS merged commit eb5f471 into microsoft:master Feb 18, 2021
@imjwang
Copy link
Contributor Author

imjwang commented Feb 18, 2021

@jameslamb @StrikerRUS Thanks for the opportunity and support! I've learned a good bit about testing and CI and I am happy to contribute to this project.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[dask] Add a scikit-learn compatibility test
3 participants