Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intel(R) Distribution for Python's sklearn patches #15

Merged
merged 16 commits into from
Jan 10, 2019

Conversation

oleksandr-pavlyk
Copy link
Contributor

of scikit-learn classes as a stand-alone module.

@fschlimb @anton-malakhov @ogrisel

They can be invoked via

python -m daal4py.sklearn_patches  script.py args

or by explicitly enabling them via

import daal4py.sklearn_patches.dispatcher
daal4py.sklearn_patches.dispatcher.enable()

Names, design, etc. are up for discussion.

Copy link
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great.

May I suggest to change the package hierarchy as follows:

- daal4py/
    - [other daal4py packages]
    - sklearn/
        - decision_forest.py  # provides sklearn compatible estimator without having to patch
        - monkeypatch/  # has all the tools to monkeypatch the top level sklearn namespace.

# disclosure or delivery of the Materials, either expressly, by
# implication, inducement, estoppel or otherwise. Any license under such
# intellectual property rights must be express and approved by Intel in
# writing.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This license header seems to contradict the ASL license of the daal4py repo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should become Apache.


def enable(name=None):
if sklearn_version != "0.20.0":
raise NotImplementedError("daal4sklearn is for scikit-learn 0.20.0 only, found version {0}".format(sklearn_version))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could only raise NotImplementError is the scikit-learn version is lower than 0.20.0 and only issue a UserWarning when scikit-learn is more recent?

This would make it easier to test the patch on the development version of scikit-learn.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How difficult is it to include our patches for 0.19?

Copy link
Contributor

@ogrisel ogrisel Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following my be handy for this:

from distutils.version import LooseVersion


if LooseVersion(sklearn_version) < LooseVersion("0.20.0"):
    raise NotImplementedError("daal4sklearn is for scikit-learn 0.20.0 only ...")
elif LooseVersion(sklearn_version) > LooseVersion("0.20.0"):
    warnings.warn("daal4sklearn {daal4py_version} has only been tested with scikit-learn 0.20.0, found version...")

@fschlimb
Copy link
Contributor

fschlimb commented Nov 21, 2018

Très cool.

I suggest extending @ogrisel's structure to

- daal4py/
   - [native daal4py stuff]
    - __main__.cpp
    - sklearn/
        - decision_forest.py  # provides sklearn compatible estimator without having to patch
        - monkeypatch/  # has all the tools to monkeypatch the top level sklearn namespace.

This makes the '-m' feature more convienent and becomes '-m daal4py'.

Also, I suggest adding a function to the daal4py package so user can do

import daal4py
daal4py.patch_sklearn()

additionally or alternatively, we might want to add patch_skl.py so the following would be enough:

import daal4py.patch_skl

Also, please find a better name than 'daal4py_utils.py', maybe simply 'utils.py'.

@fschlimb
Copy link
Contributor

We also need tests and documentation.
docu is generated through sphinx (and its autodoc) and lives on the gh-pages branch.

@oleksandr-pavlyk
Copy link
Contributor Author

I have restructured the code as suggested. Monkey patching can now be enabled via python -m daal4py script.py.

I have used LooseVersion, updated the license.

@oleksandr-pavlyk
Copy link
Contributor Author

Running

python -m daal4py -m pytest --disable-warnings --pyargs sklearn

in an environment with 0.20.1 installed currently produces 8 failures. One of them in test_parallel_classification which I can not explain today.

@anton-malakhov
Copy link

why not to rename the module as just daal - it will follow the same suit with mkl-service and tbb4py.

@oleksandr-pavlyk
Copy link
Contributor Author

@anton-malakhov That would require renaming daal4py package into daal. That name has already been claimed by installed pydaal.

@oleksandr-pavlyk
Copy link
Contributor Author

@ogrisel I looked into why test_parallel_classification fails.

Consider the following snippet:

# test_parallel_classifier.py
import sklearn
import sklearn.datasets
import sklearn.model_selection
import sklearn.utils
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC

iris = sklearn.datasets.load_iris()
rng = sklearn.utils.check_random_state(0)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    iris.data, iris.target, random_state=rng)

backend = 'loky'
# backend = 'multiprocessing'
with sklearn.utils.parallel_backend(backend):
    base_est = SVC(gamma='scale', decision_function_shape='ovr')
    ensemble = BaggingClassifier(base_est, n_jobs = 2, random_state=0).fit(X_train, y_train)

print(ensemble.estimators_[0]._dual_coef_)

Execution vanilla scikit-learn with python test_parallel_classifier.py prints matrix of dual coefficients as expected.

Running the above snippet with monkey-patched code, python -m daal4py test_parallel_classifier.py
results in an error with n_jobs=2 and backend='loky', but not with n_jobs=1, or n_job=2 and backend='multiprocessing':

  File "test_parallel_classifier.py", line 19, in <module>
    print(ensemble.estimators_[0]._dual_coef_)
  File "./daal4py/sklearn/monkeypatch/svm.py", line 33, in _dual_coef_getter
    return self._internal_dual_coef_
AttributeError: 'SVC' object has no attribute '_internal_dual_coef_'

The following is my account of what transpires to the best of my understanding.

  1. Monkey patching adds _dual_coef_ attribute to the class SVC, set be property(_dual_coef_getter, _dual_coef_setter). These are implemented in svm.py#L32-53, and serve the purpose of deleting daal_model from the instance if someone attempts to modify _dual_coef_ (see svm/tests/test_svm::test_tweak_params).

  2. With 'loky' the instance of the class on the worker ends up being non-monkey patches. The fitted instances has attribute _dual_coef_ set to an explicit array. (I checked this by adding prints to _parallel_build_estimators in ensemble/bagging.py)

  3. When results are gather at the host, somehow the fitted instances are those of the patched class, and the computed on the worker value of _dual_coef_ does not overwrite the property, but gets discarded.
    The value of _internal_dual_coef_ was never set by un-patched fit on the worker, resulting in the observed behavior.

This problem is not observable if daal4py.sklearn.patch_sklearn() is called from sklearn/__init__.py, since instances of SVC on workers appear to be patched.

Hence questions:

  • If monkey patching is not done inside sklearn/__init__.py, is there a way to ensure that patching is also applied to the workers spawned by loky?
  • If that proves impossible, could you suggest a way to adjust _internal_dual_coef_getter to have a reasonable fall-back if _internal_dual_coef_ attribute is unavailable? Can the value of dual_coef_ be reused?
  • The only reason these setter and getter were added is to pass test_tweak_params test. Perhaps it is better/easier to mark this test as expected to fail with monkey patching?

Thanks for your input.

@oleksandr-pavlyk
Copy link
Contributor Author

oleksandr-pavlyk commented Dec 6, 2018

If monkey patching line is added to site-packages/sklearn/__init__.py , run of pytest --disable-warnings --pyargs sklearn completes with 6 failures, 5 of which all related to use of SVC. These are

  1. sklearn.ensemble.tests.test_bagging::test_sparse_classification
  2. sklearn.svm.tests.test_sparse::test_svc_iris
  3. sklearn.svm.tests.test_sparse::test_sparse_realdata
  4. sklearn.svm.tests.test_svm::test_precomputed
  5. sklearn.svm.tests.test_svm::test_sample_weights

The other failing test is sklearn.cluster.tests.test_k_means::test_k_means_fit_predict. This is resolved if n_init is set to 1, instead of default 10.

The cluster of 5 SVC-related failures all have one thing in common. The dataset contains duplicate samples, or bagging results in choosing a subset of features, for which such duplicate samples appear.

I feel compelled to point out that such inputs lead to quadratic optimization problem with non-positive definite matrix Q, leading to ambiguity, typically resolved via use of parameter tau (see src/libsvm/svm.cpp#L89). While it is important for the implementation to gracefully handle such inputs, they can not be clearly separated into clusters, and lead of ambiguities and sensitivity of the solution to platform/compiler specific quirks.

I was able to resolve some of these failures by strengthening tolerance parameter:

diff --git a/sklearn/ensemble/tests/test_bagging.py b/sklearn/ensemble/tests/test_bagging.py
index 608df3dc43..5dcfd4dbea 100644
--- a/sklearn/ensemble/tests/test_bagging.py
+++ b/sklearn/ensemble/tests/test_bagging.py
@@ -118,7 +118,7 @@ def test_sparse_classification():
             for f in ['predict', 'predict_proba', 'predict_log_proba', 'decision_function']:
                 # Trained on sparse format
                 sparse_classifier = BaggingClassifier(
-                    base_estimator=CustomSVC(gamma='scale',
+                    base_estimator=CustomSVC(gamma='scale', tol=1e-8,
                                              decision_function_shape='ovr'),
                     random_state=1,
                     **params
@@ -127,7 +127,7 @@ def test_sparse_classification():

                 # Trained on dense format
                 dense_classifier = BaggingClassifier(
-                    base_estimator=CustomSVC(gamma='scale',
+                    base_estimator=CustomSVC(gamma='scale', tol=1e-8,
                                              decision_function_shape='ovr'),
                     random_state=1,
                     **params

and sometimes additionally loosening out fuzz:

diff --git a/sklearn/svm/tests/test_sparse.py b/sklearn/svm/tests/test_sparse.py
index ce14bda1db..33f695538a 100644
--- a/sklearn/svm/tests/test_sparse.py
+++ b/sklearn/svm/tests/test_sparse.py
@@ -138,17 +138,17 @@ def test_svc_with_custom_kernel():
 def test_svc_iris():
     # Test the sparse SVC with the iris dataset
     for k in ('linear', 'poly', 'rbf'):
-        sp_clf = svm.SVC(gamma='scale', kernel=k).fit(iris.data, iris.target)
-        clf = svm.SVC(gamma='scale', kernel=k).fit(iris.data.toarray(),
+        sp_clf = svm.SVC(gamma='scale', kernel=k, tol=1e-10).fit(iris.data, iris.target)
+        clf = svm.SVC(gamma='scale', kernel=k, tol=1e-10).fit(iris.data.toarray(),
                                                    iris.target)
 
         assert_array_almost_equal(clf.support_vectors_,
                                   sp_clf.support_vectors_.toarray())
-        assert_array_almost_equal(clf.dual_coef_, sp_clf.dual_coef_.toarray())
+        assert_array_almost_equal(clf.dual_coef_, sp_clf.dual_coef_.toarray(), decimal=4, err_msg=k)
         assert_array_almost_equal(
             clf.predict(iris.data.toarray()), sp_clf.predict(iris.data))
         if k == 'linear':
-            assert_array_almost_equal(clf.coef_, sp_clf.coef_.toarray())
+            assert_array_almost_equal(clf.coef_, sp_clf.coef_.toarray(), decimal=4)
 
 
 def test_sparse_decision_function():
@@ -310,11 +310,12 @@ def test_sparse_realdata():
          3., 0., 0., 2., 2., 1., 3., 1., 1., 0., 1., 2., 1.,
          1., 3.])
 
-    clf = svm.SVC(kernel='linear').fit(X.toarray(), y)
-    sp_clf = svm.SVC(kernel='linear').fit(sparse.coo_matrix(X), y)
+    clf = svm.SVC(kernel='linear', tol=1e-10).fit(X.toarray(), y)
+    sp_clf = svm.SVC(kernel='linear', tol=1e-10).fit(sparse.coo_matrix(X), y)
 
+    assert_array_equal(clf.support_, sp_clf.support_)
     assert_array_equal(clf.support_vectors_, sp_clf.support_vectors_.toarray())
-    assert_array_equal(clf.dual_coef_, sp_clf.dual_coef_.toarray())
+    assert_array_almost_equal(clf.dual_coef_, sp_clf.dual_coef_.toarray())
 
 
 def test_sparse_svc_clone_with_callable_kernel():

I'd like to argue that such failures are not misleading, and I'd suggest to add noise to the input data used for SVM to resolve any sample duplicates.

Perhaps I should open a separate discussion issue to this effect in scikit-learn project.

@ogrisel
Copy link
Contributor

ogrisel commented Dec 7, 2018

For the first issue (the monkeypatch that does not work with loky workers and arguably also with multiprocessing workers under Windows), I will try to think about a possible solution, maybe using class inheritance. I have some ideas but it requires some experimentation.

For the support vector machine indentifiability issue with duplicate samples, thank you very much for your analysis. I am currently building daal4py to reproduce the issue on my laptop and see the problem in more details and devise what is the best course of action (decreasing tol or changing the dataset to avoid duplicated samples).

@ogrisel
Copy link
Contributor

ogrisel commented Dec 7, 2018

Before merging this PR, I think it would be good to change the travis configuration to launch the sklearn tests for the last stable vanilla release of scikit-learn with the code of the patches daal4py from this repo.

By looking at the current travis config, it seems that the daal4py tests them selves are not run by travis. Is this intentional? Or are they run automatically by conda build?

Latter we can also add a travis cron job that does the same against scikit-learn master.

@fschlimb
Copy link
Contributor

fschlimb commented Dec 7, 2018

Yes, the tests are run by conda build which is called in travis CI.

@fschlimb
Copy link
Contributor

fschlimb commented Dec 7, 2018

I'd like to have travis only run a the subset of sklearn tests that are affected by the daal4py. It should finish in reasonable time.
A cron job for regular full sklearn testing makes sense to me.

if LooseVersion(sklearn_version) < LooseVersion("0.20.0"):
raise NotImplementedError("daal4sklearn is for scikit-learn 0.20.0 only ...")
elif LooseVersion(sklearn_version) > LooseVersion("0.20.1"):
warnings.warn("daal4sklearn {daal4py_version} has only been tested with scikit-learn 0.20.0, found version...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warnings is not imported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll fix it in a moment.

…ould not be setting them, but setting _internal_dual_coef_ and _internal_intercept_ instead
@oleksandr-pavlyk
Copy link
Contributor Author

@ogrisel I tried to implement a work-around for the issue of interaction between loky and dynamic monkey-patching where workers end up being unpatched, and issues arise when combining worker results (trained instances) and clones of patches class on the host.

That can be found in branch features/sklearn-patches-try-subclassing. With this change, new failures sprout, like this one

python -m daal4py -m pytest --disable-warnings --pyargs sklearn.svm.tests.test_svm::test_unfitted

stemming from exception messages now containing the name of the class used to replace SVC instead of SVC itself:

>       raise self.test_case.failureException(msg)
E       AssertionError: ".*\bSVC\b.*\bnot\b.*\bfitted\b" does not match "This SVC_daal4py instance is not fitted yet. Call 'fit' with appropriate arguments before using this method."

Do you have any suggestions on how to work around these issues, or perhaps change the test to allow for patched test to pass tests.

Thank you

@ogrisel
Copy link
Contributor

ogrisel commented Dec 21, 2018

Wouldn't be possible to keep the same class names using a strategy that looks like the following:

from sklearn.svm import SVC as SVC_sklearn

class SVC(SVC_sklearn):
    # put the patched methods here

install_patch():
    from sklearn import svm
    sklearn.svm.SVC = SVC

@oleksandr-pavlyk
Copy link
Contributor Author

@ogrisel This approach works well except in general. It runs into trouble with pytest and Ridge though, perhaps unable to handle uses of super(Ridge, self), and claims that self must be a subtype or an instance of Ridge, which it is. So I am leaving Ridge as before. There was a problem with pickling of LinearRegression when it was monkey patches with subclassing as well.

I also updated pairwise distance computation to incorporate fixes from upstream.

copy_X=copy_X, n_jobs=n_jobs)


setattr(LinearRegression, 'fit', fit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason why you do not define the functions inline?
The 'normal' way is to define class-methods inline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less copy-pasting is the only reason.

@fschlimb
Copy link
Contributor

I merged with trunk to separate the fix for low order moments, coming up next.

@oleksandr-pavlyk
Copy link
Contributor Author

Thanks @fschlimb . I can confirm that merging tc/parse_moments into this branch the problem I was experiencing with low_order_moments on DAAL 2019.1 go away.

@fschlimb fschlimb merged commit cdc1c85 into master Jan 10, 2019
@oleksandr-pavlyk oleksandr-pavlyk deleted the feature/sklearn-patches branch January 10, 2019 21:36
@oleksandr-pavlyk oleksandr-pavlyk restored the feature/sklearn-patches branch January 10, 2019 21:36
@oleksandr-pavlyk
Copy link
Contributor Author

This branch was merged in error.

Master and the branch behind this request was force pushed to revert that.

Regrettably there is no way to reopen this PR. So I reopened it as #35

Vika-F pushed a commit to Vika-F/scikit-learn-intelex that referenced this pull request Apr 4, 2024
Native KMeans: fix wrong parameters passed to DAAL prediction object

Native logistic regression: scale DAAL's loss function value and gradient by n_samples

daal4py logistic regression: don't compute loss function value and gradient twice

* Align kmeans native benchmark to sklearn

* Don't evaluate func, grad if not necessary

* Scale loss value and gradient like sklearn does

* Specify maxIterations, accuracyThreshold for multi_class_classifier

* Fix indentation in Makefiles

* Fix logic to select n_features_per_node in daal4py df_regr

* Specifically say memorySavingMode=false in native df_regr

* Reformat native df_regr bench
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants