Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] Add a scikit-learn compatibility test #3894

Closed
jameslamb opened this issue Feb 2, 2021 · 9 comments · Fixed by #3947
Closed

[dask] Add a scikit-learn compatibility test #3894

jameslamb opened this issue Feb 2, 2021 · 9 comments · Fixed by #3947

Comments

@jameslamb
Copy link
Collaborator

Summary

scikit-learn supports a test that people writing scikit-learn extensions can use to check API compatibility with the rest of the scikit-learn ecosystem. Such a test should be added for the code in python-package/lightgbm/dask.py.

Motivation

The discussion in #3883 focused a lot on compatibility of the dask estimators, like DaskLGBMClassifier, with the broader scikit-learn ecosystem. It was proposed in #3883 (comment) that we should have a test on the Dask module similar to

def test_sklearn_integration(estimator, check):
.

I think that actually several of the tests from that module should have Dask equivalents, to test compatibility with the scikit-learn ecosystem.

References

More information on the scikit-learn API is available in "Developing scikit-learn estimators". In general, changes to lightgbm.dask that make it align more closely with that spec are welcome.

@imjwang
Copy link
Contributor

imjwang commented Feb 2, 2021

Hey, I would like to be assigned this issue. I am in a software methodology class and passionate about data science.

@jameslamb
Copy link
Collaborator Author

Thanks @lmjwang! You're welcome to take it.

A few things you should know:

  1. the Dask tests will only work on Linux, so you will have to develop in a Linux environment
  2. you can build the package from source by running cd python-package && python setup.py install
  3. you can run just the Dask tests to test your code by running pytest tests/python_package_test/test_dask.py

Feel free to ask questions here if you get stuck. Before beginning, please read the items that are linked in the issue's description.

@imjwang
Copy link
Contributor

imjwang commented Feb 7, 2021

Hey @jameslamb, I'm going through test_sklearn.py and I'm curious why lgb.sklearn.LGBMRanker is not included here?

def _tested_estimators():
    for Estimator in [lgb.sklearn.LGBMClassifier, lgb.sklearn.LGBMRegressor]:
        yield Estimator()

@jameslamb
Copy link
Collaborator Author

Good question! I do not actually know. @StrikerRUS do you know?

By the way in the future, a link to code is more helpful than just copying it:

def _tested_estimators():

@StrikerRUS
Copy link
Collaborator

Scikit-learn doesn't have learning-to-rank applications. So there is no point to test LGBMRanker to be "compatible" with something that doesn't support ranking.
From the technical point of view, LGBMRanker requires group argument to be passed to fit() method. And scikit-learn API tests knows nothing about it.

@imjwang
Copy link
Contributor

imjwang commented Feb 9, 2021

I copied these lines of code to test_dask.py with their dependencies, but used lgb.DaskLGBMClassifier, lgb.DaskLGBMRegressor for _tested_estimators():

def _tested_estimators():
for Estimator in [lgb.sklearn.LGBMClassifier, lgb.sklearn.LGBMRegressor]:
yield Estimator()
if sk_version < parse_version("0.23"):
def _generate_checks_per_estimator(check_generator, estimators):
for estimator in estimators:
name = estimator.__class__.__name__
for check in check_generator(name, estimator):
yield estimator, check
@pytest.mark.skipif(
sk_version < parse_version("0.21"), reason="scikit-learn version is less than 0.21"
)
@pytest.mark.parametrize(
"estimator, check",
_generate_checks_per_estimator(_yield_all_checks, _tested_estimators()),
)
def test_sklearn_integration(estimator, check):
xfail_checks = estimator._get_tags()["_xfail_checks"]
check_name = check.__name__ if hasattr(check, "__name__") else check.func.__name__
if xfail_checks and check_name in xfail_checks:
warnings.warn(xfail_checks[check_name], SkipTestWarning)
raise SkipTest
estimator.set_params(min_child_samples=1, min_data_in_bin=1)
name = estimator.__class__.__name__
check(name, estimator)
else:
@parametrize_with_checks(list(_tested_estimators()))
def test_sklearn_integration(estimator, check, request):
estimator.set_params(min_child_samples=1, min_data_in_bin=1)
check(estimator)

But running pytest tests/python_package_test/test_dask.py -k test_sklearn_integration
is giving me a bunch of failed tests mostly with ValueError: No clients found.

Does this satisfy the task or am I missing something? @jameslamb

@jameslamb
Copy link
Collaborator Author

Seems like a good approach! If you feel like you have something that is mostly working, please open a pull request and I can give more specific help there.

For the No clients found error...

Creating an instance of the Dask estimators requires an active Dask client. We do this with a client fixture from dask/distributed.

You can use this by passing the keyword argument client into the test function, like

def test_training_does_not_fail_on_port_conflicts(client):

StrikerRUS pushed a commit that referenced this issue Feb 18, 2021
* add test_dask.py

* Update tests/python_package_test/test_dask.py

Co-authored-by: James Lamb <jaylamb20@gmail.com>

* clients

* remove ports

* safe sklearn checks

* safe sklearn checks

* fix whitespace

* fix whitespace-try 2

* fix whitespace-try 3

* isort

* isort

* sklearn_checks_to_learn

Co-authored-by: James Lamb <jaylamb20@gmail.com>
@StrikerRUS
Copy link
Collaborator

We may want to get back for this after resolving upstream issue dask/dask-ml#796.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants