Intel(R) Distribution for Python's sklearn patches #15

oleksandr-pavlyk · 2018-11-21T15:24:06Z

of scikit-learn classes as a stand-alone module.

They can be invoked via

python -m daal4py.sklearn_patches  script.py args

or by explicitly enabling them via

import daal4py.sklearn_patches.dispatcher
daal4py.sklearn_patches.dispatcher.enable()

Names, design, etc. are up for discussion.

ogrisel

This looks great.

May I suggest to change the package hierarchy as follows:

- daal4py/
    - [other daal4py packages]
    - sklearn/
        - decision_forest.py  # provides sklearn compatible estimator without having to patch
        - monkeypatch/  # has all the tools to monkeypatch the top level sklearn namespace.

ogrisel · 2018-11-21T15:37:05Z

daal4py/sklearn_patches/__main__.py

+# disclosure or delivery of the Materials, either expressly, by
+# implication, inducement, estoppel or otherwise.  Any license under such
+# intellectual property rights must be express and approved by Intel in
+# writing.


This license header seems to contradict the ASL license of the daal4py repo.

Yes, this should become Apache.

ogrisel · 2018-11-21T15:39:56Z

daal4py/sklearn_patches/dispatcher.py

+
+def enable(name=None):
+    if sklearn_version != "0.20.0":
+        raise NotImplementedError("daal4sklearn is for scikit-learn 0.20.0 only, found version {0}".format(sklearn_version))


Maybe you could only raise NotImplementError is the scikit-learn version is lower than 0.20.0 and only issue a UserWarning when scikit-learn is more recent?

This would make it easier to test the patch on the development version of scikit-learn.

How difficult is it to include our patches for 0.19?

The following my be handy for this:

from distutils.version import LooseVersion if LooseVersion(sklearn_version) < LooseVersion("0.20.0"): raise NotImplementedError("daal4sklearn is for scikit-learn 0.20.0 only ...") elif LooseVersion(sklearn_version) > LooseVersion("0.20.0"): warnings.warn("daal4sklearn {daal4py_version} has only been tested with scikit-learn 0.20.0, found version...")

fschlimb · 2018-11-21T16:23:10Z

Très cool.

I suggest extending @ogrisel's structure to

- daal4py/
   - [native daal4py stuff]
    - __main__.cpp
    - sklearn/
        - decision_forest.py  # provides sklearn compatible estimator without having to patch
        - monkeypatch/  # has all the tools to monkeypatch the top level sklearn namespace.

This makes the '-m' feature more convienent and becomes '-m daal4py'.

Also, I suggest adding a function to the daal4py package so user can do

import daal4py
daal4py.patch_sklearn()

additionally or alternatively, we might want to add patch_skl.py so the following would be enough:

import daal4py.patch_skl

Also, please find a better name than 'daal4py_utils.py', maybe simply 'utils.py'.

fschlimb · 2018-11-21T16:40:08Z

We also need tests and documentation.
docu is generated through sphinx (and its autodoc) and lives on the gh-pages branch.

oleksandr-pavlyk · 2018-11-26T19:43:31Z

I have restructured the code as suggested. Monkey patching can now be enabled via python -m daal4py script.py.

I have used LooseVersion, updated the license.

oleksandr-pavlyk · 2018-12-06T03:41:36Z

Running

python -m daal4py -m pytest --disable-warnings --pyargs sklearn

in an environment with 0.20.1 installed currently produces 8 failures. One of them in test_parallel_classification which I can not explain today.

anton-malakhov · 2018-12-06T06:54:38Z

why not to rename the module as just daal - it will follow the same suit with mkl-service and tbb4py.

oleksandr-pavlyk · 2018-12-06T12:55:50Z

@anton-malakhov That would require renaming daal4py package into daal. That name has already been claimed by installed pydaal.

oleksandr-pavlyk · 2018-12-06T19:59:54Z

@ogrisel I looked into why test_parallel_classification fails.

Consider the following snippet:

# test_parallel_classifier.py
import sklearn
import sklearn.datasets
import sklearn.model_selection
import sklearn.utils
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC

iris = sklearn.datasets.load_iris()
rng = sklearn.utils.check_random_state(0)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    iris.data, iris.target, random_state=rng)

backend = 'loky'
# backend = 'multiprocessing'
with sklearn.utils.parallel_backend(backend):
    base_est = SVC(gamma='scale', decision_function_shape='ovr')
    ensemble = BaggingClassifier(base_est, n_jobs = 2, random_state=0).fit(X_train, y_train)

print(ensemble.estimators_[0]._dual_coef_)

Execution vanilla scikit-learn with python test_parallel_classifier.py prints matrix of dual coefficients as expected.

Running the above snippet with monkey-patched code, python -m daal4py test_parallel_classifier.py
results in an error with n_jobs=2 and backend='loky', but not with n_jobs=1, or n_job=2 and backend='multiprocessing':

  File "test_parallel_classifier.py", line 19, in <module>
    print(ensemble.estimators_[0]._dual_coef_)
  File "./daal4py/sklearn/monkeypatch/svm.py", line 33, in _dual_coef_getter
    return self._internal_dual_coef_
AttributeError: 'SVC' object has no attribute '_internal_dual_coef_'

The following is my account of what transpires to the best of my understanding.

Monkey patching adds _dual_coef_ attribute to the class SVC, set be property(_dual_coef_getter, _dual_coef_setter). These are implemented in svm.py#L32-53, and serve the purpose of deleting daal_model from the instance if someone attempts to modify _dual_coef_ (see svm/tests/test_svm::test_tweak_params).
With 'loky' the instance of the class on the worker ends up being non-monkey patches. The fitted instances has attribute _dual_coef_ set to an explicit array. (I checked this by adding prints to _parallel_build_estimators in ensemble/bagging.py)
When results are gather at the host, somehow the fitted instances are those of the patched class, and the computed on the worker value of _dual_coef_ does not overwrite the property, but gets discarded.
The value of _internal_dual_coef_ was never set by un-patched fit on the worker, resulting in the observed behavior.

This problem is not observable if daal4py.sklearn.patch_sklearn() is called from sklearn/__init__.py, since instances of SVC on workers appear to be patched.

Hence questions:

If monkey patching is not done inside sklearn/__init__.py, is there a way to ensure that patching is also applied to the workers spawned by loky?
If that proves impossible, could you suggest a way to adjust _internal_dual_coef_getter to have a reasonable fall-back if _internal_dual_coef_ attribute is unavailable? Can the value of dual_coef_ be reused?
The only reason these setter and getter were added is to pass test_tweak_params test. Perhaps it is better/easier to mark this test as expected to fail with monkey patching?

Thanks for your input.

oleksandr-pavlyk · 2018-12-06T20:40:59Z

If monkey patching line is added to site-packages/sklearn/__init__.py , run of pytest --disable-warnings --pyargs sklearn completes with 6 failures, 5 of which all related to use of SVC. These are

sklearn.ensemble.tests.test_bagging::test_sparse_classification
sklearn.svm.tests.test_sparse::test_svc_iris
sklearn.svm.tests.test_sparse::test_sparse_realdata
sklearn.svm.tests.test_svm::test_precomputed
sklearn.svm.tests.test_svm::test_sample_weights

The other failing test is sklearn.cluster.tests.test_k_means::test_k_means_fit_predict. This is resolved if n_init is set to 1, instead of default 10.

The cluster of 5 SVC-related failures all have one thing in common. The dataset contains duplicate samples, or bagging results in choosing a subset of features, for which such duplicate samples appear.

I feel compelled to point out that such inputs lead to quadratic optimization problem with non-positive definite matrix Q, leading to ambiguity, typically resolved via use of parameter tau (see src/libsvm/svm.cpp#L89). While it is important for the implementation to gracefully handle such inputs, they can not be clearly separated into clusters, and lead of ambiguities and sensitivity of the solution to platform/compiler specific quirks.

I was able to resolve some of these failures by strengthening tolerance parameter:

diff --git a/sklearn/ensemble/tests/test_bagging.py b/sklearn/ensemble/tests/test_bagging.py
index 608df3dc43..5dcfd4dbea 100644
--- a/sklearn/ensemble/tests/test_bagging.py
+++ b/sklearn/ensemble/tests/test_bagging.py
@@ -118,7 +118,7 @@ def test_sparse_classification():
             for f in ['predict', 'predict_proba', 'predict_log_proba', 'decision_function']:
                 # Trained on sparse format
                 sparse_classifier = BaggingClassifier(
-                    base_estimator=CustomSVC(gamma='scale',
+                    base_estimator=CustomSVC(gamma='scale', tol=1e-8,
                                              decision_function_shape='ovr'),
                     random_state=1,
                     **params
@@ -127,7 +127,7 @@ def test_sparse_classification():

                 # Trained on dense format
                 dense_classifier = BaggingClassifier(
-                    base_estimator=CustomSVC(gamma='scale',
+                    base_estimator=CustomSVC(gamma='scale', tol=1e-8,
                                              decision_function_shape='ovr'),
                     random_state=1,
                     **params

and sometimes additionally loosening out fuzz:

diff --git a/sklearn/svm/tests/test_sparse.py b/sklearn/svm/tests/test_sparse.py
index ce14bda1db..33f695538a 100644
--- a/sklearn/svm/tests/test_sparse.py
+++ b/sklearn/svm/tests/test_sparse.py
@@ -138,17 +138,17 @@ def test_svc_with_custom_kernel():
 def test_svc_iris():
     # Test the sparse SVC with the iris dataset
     for k in ('linear', 'poly', 'rbf'):
-        sp_clf = svm.SVC(gamma='scale', kernel=k).fit(iris.data, iris.target)
-        clf = svm.SVC(gamma='scale', kernel=k).fit(iris.data.toarray(),
+        sp_clf = svm.SVC(gamma='scale', kernel=k, tol=1e-10).fit(iris.data, iris.target)
+        clf = svm.SVC(gamma='scale', kernel=k, tol=1e-10).fit(iris.data.toarray(),
                                                    iris.target)
 
         assert_array_almost_equal(clf.support_vectors_,
                                   sp_clf.support_vectors_.toarray())
-        assert_array_almost_equal(clf.dual_coef_, sp_clf.dual_coef_.toarray())
+        assert_array_almost_equal(clf.dual_coef_, sp_clf.dual_coef_.toarray(), decimal=4, err_msg=k)
         assert_array_almost_equal(
             clf.predict(iris.data.toarray()), sp_clf.predict(iris.data))
         if k == 'linear':
-            assert_array_almost_equal(clf.coef_, sp_clf.coef_.toarray())
+            assert_array_almost_equal(clf.coef_, sp_clf.coef_.toarray(), decimal=4)
 
 
 def test_sparse_decision_function():
@@ -310,11 +310,12 @@ def test_sparse_realdata():
          3., 0., 0., 2., 2., 1., 3., 1., 1., 0., 1., 2., 1.,
          1., 3.])
 
-    clf = svm.SVC(kernel='linear').fit(X.toarray(), y)
-    sp_clf = svm.SVC(kernel='linear').fit(sparse.coo_matrix(X), y)
+    clf = svm.SVC(kernel='linear', tol=1e-10).fit(X.toarray(), y)
+    sp_clf = svm.SVC(kernel='linear', tol=1e-10).fit(sparse.coo_matrix(X), y)
 
+    assert_array_equal(clf.support_, sp_clf.support_)
     assert_array_equal(clf.support_vectors_, sp_clf.support_vectors_.toarray())
-    assert_array_equal(clf.dual_coef_, sp_clf.dual_coef_.toarray())
+    assert_array_almost_equal(clf.dual_coef_, sp_clf.dual_coef_.toarray())
 
 
 def test_sparse_svc_clone_with_callable_kernel():

I'd like to argue that such failures are not misleading, and I'd suggest to add noise to the input data used for SVM to resolve any sample duplicates.

Perhaps I should open a separate discussion issue to this effect in scikit-learn project.

ogrisel · 2018-12-07T15:22:48Z

For the first issue (the monkeypatch that does not work with loky workers and arguably also with multiprocessing workers under Windows), I will try to think about a possible solution, maybe using class inheritance. I have some ideas but it requires some experimentation.

For the support vector machine indentifiability issue with duplicate samples, thank you very much for your analysis. I am currently building daal4py to reproduce the issue on my laptop and see the problem in more details and devise what is the best course of action (decreasing tol or changing the dataset to avoid duplicated samples).

ogrisel · 2018-12-07T15:53:20Z

Before merging this PR, I think it would be good to change the travis configuration to launch the sklearn tests for the last stable vanilla release of scikit-learn with the code of the patches daal4py from this repo.

By looking at the current travis config, it seems that the daal4py tests them selves are not run by travis. Is this intentional? Or are they run automatically by conda build?

Latter we can also add a travis cron job that does the same against scikit-learn master.

fschlimb · 2018-12-07T16:01:21Z

Yes, the tests are run by conda build which is called in travis CI.

fschlimb · 2018-12-07T16:03:16Z

I'd like to have travis only run a the subset of sklearn tests that are affected by the daal4py. It should finish in reasonable time.
A cron job for regular full sklearn testing makes sense to me.

ogrisel · 2018-12-10T15:17:13Z

daal4py/sklearn/monkeypatch/dispatcher.py

+    if LooseVersion(sklearn_version) < LooseVersion("0.20.0"):
+        raise NotImplementedError("daal4sklearn is for scikit-learn 0.20.0 only ...")
+    elif LooseVersion(sklearn_version) > LooseVersion("0.20.1"):
+        warnings.warn("daal4sklearn {daal4py_version} has only been tested with scikit-learn 0.20.0, found version...")


The warnings is not imported.

Thanks! I'll fix it in a moment.

of scikit-learn classes as a stand-alone module.

Also formatting changes to make lines shorter

…ould not be setting them, but setting _internal_dual_coef_ and _internal_intercept_ instead

oleksandr-pavlyk · 2018-12-13T16:16:30Z

@ogrisel I tried to implement a work-around for the issue of interaction between loky and dynamic monkey-patching where workers end up being unpatched, and issues arise when combining worker results (trained instances) and clones of patches class on the host.

That can be found in branch features/sklearn-patches-try-subclassing. With this change, new failures sprout, like this one

python -m daal4py -m pytest --disable-warnings --pyargs sklearn.svm.tests.test_svm::test_unfitted

stemming from exception messages now containing the name of the class used to replace SVC instead of SVC itself:

>       raise self.test_case.failureException(msg)
E       AssertionError: ".*\bSVC\b.*\bnot\b.*\bfitted\b" does not match "This SVC_daal4py instance is not fitted yet. Call 'fit' with appropriate arguments before using this method."

Do you have any suggestions on how to work around these issues, or perhaps change the test to allow for patched test to pass tests.

Thank you

ogrisel · 2018-12-21T11:08:04Z

Wouldn't be possible to keep the same class names using a strategy that looks like the following:

from sklearn.svm import SVC as SVC_sklearn

class SVC(SVC_sklearn):
    # put the patched methods here

install_patch():
    from sklearn import svm
    sklearn.svm.SVC = SVC

oleksandr-pavlyk · 2019-01-08T16:12:11Z

@ogrisel This approach works well except in general. It runs into trouble with pytest and Ridge though, perhaps unable to handle uses of super(Ridge, self), and claims that self must be a subtype or an instance of Ridge, which it is. So I am leaving Ridge as before. There was a problem with pickling of LinearRegression when it was monkey patches with subclassing as well.

I also updated pairwise distance computation to incorporate fixes from upstream.

fschlimb · 2019-01-09T14:41:50Z

daal4py/sklearn/monkeypatch/linear.py

+            copy_X=copy_X, n_jobs=n_jobs)
+
+
+setattr(LinearRegression, 'fit', fit)


Any particular reason why you do not define the functions inline?
The 'normal' way is to define class-methods inline.

Less copy-pasting is the only reason.

fschlimb · 2019-01-10T17:26:37Z

I merged with trunk to separate the fix for low order moments, coming up next.

oleksandr-pavlyk · 2019-01-10T20:25:36Z

Thanks @fschlimb . I can confirm that merging tc/parse_moments into this branch the problem I was experiencing with low_order_moments on DAAL 2019.1 go away.

oleksandr-pavlyk · 2019-01-10T21:57:01Z

This branch was merged in error.

Master and the branch behind this request was force pushed to revert that.

Regrettably there is no way to reopen this PR. So I reopened it as #35

Native KMeans: fix wrong parameters passed to DAAL prediction object Native logistic regression: scale DAAL's loss function value and gradient by n_samples daal4py logistic regression: don't compute loss function value and gradient twice * Align kmeans native benchmark to sklearn * Don't evaluate func, grad if not necessary * Scale loss value and gradient like sklearn does * Specify maxIterations, accuracyThreshold for multi_class_classifier * Fix indentation in Makefiles * Fix logic to select n_features_per_node in daal4py df_regr * Specifically say memorySavingMode=false in native df_regr * Reformat native df_regr bench

ogrisel reviewed Nov 21, 2018

View reviewed changes

oleksandr-pavlyk force-pushed the feature/sklearn-patches branch from 20affc6 to a4d5673 Compare November 26, 2018 19:41

This was referenced Dec 6, 2018

Tc/precomputed vars #19

Merged

Change in behavior due to #19 #23

Closed

oleksandr-pavlyk force-pushed the feature/sklearn-patches branch from a4d5673 to 903977d Compare December 6, 2018 01:16

oleksandr-pavlyk force-pushed the feature/sklearn-patches branch from 903977d to f0a2134 Compare December 6, 2018 19:35

oleksandr-pavlyk mentioned this pull request Dec 7, 2018

Plea to not use datasets with duplicate samples in SVC tests scikit-learn/scikit-learn#12738

Open

ogrisel reviewed Dec 10, 2018

View reviewed changes

oleksandr-pavlyk added 10 commits December 12, 2018 10:36

Intel(R) Distribution for Python's patches

5a441ce

of scikit-learn classes as a stand-alone module.

renamed sklearn_patches to sklearn/ per PR feedback

915057e

moved __main__.py up one level

4e5e492

working code

3e88fde

changed license to Apache 2 per daal4py license

6bb3c79

renamed decision forest class to drop Daal prefix

b9c4f30

fixing test failures

215467c

Removed unused imports

4e3c4aa

Also formatting changes to make lines shorter

import CovergenceWarning

52ed7b7

added missing import of numpy

04c5baf

oleksandr-pavlyk added 2 commits December 12, 2018 10:36

Since _dual_coef_ and _intercept_ are properties in the class, fit sh…

c937af3

…ould not be setting them, but setting _internal_dual_coef_ and _internal_intercept_ instead

import warnings

8689d0b

oleksandr-pavlyk force-pushed the feature/sklearn-patches branch from 8797f04 to 8689d0b Compare December 12, 2018 18:09

WIP

93d5da2

another (unsuccessful) take on pretending SVC_daal4py is SVC

c770059

ogrisel mentioned this pull request Dec 21, 2018

[WIP] Make SVC tests independent of SV ordering scikit-learn/scikit-learn#12849

Open

implemented subclassing and used it for SVM and KMeans

a823295

fschlimb reviewed Jan 9, 2019

View reviewed changes

Merge branch 'master' into feature/sklearn-patches

61884b9

fschlimb merged commit cdc1c85 into master Jan 10, 2019

fschlimb mentioned this pull request Jan 10, 2019

Revert "Intel(R) Distribution for Python's sklearn patches" #34

Merged

oleksandr-pavlyk deleted the feature/sklearn-patches branch January 10, 2019 21:36

oleksandr-pavlyk restored the feature/sklearn-patches branch January 10, 2019 21:36

oleksandr-pavlyk mentioned this pull request Jan 10, 2019

Feature/sklearn patches #35

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intel(R) Distribution for Python's sklearn patches #15

Intel(R) Distribution for Python's sklearn patches #15

oleksandr-pavlyk commented Nov 21, 2018

ogrisel left a comment

ogrisel Nov 21, 2018

fschlimb Nov 21, 2018

ogrisel Nov 21, 2018

fschlimb Nov 21, 2018

ogrisel Nov 21, 2018 •

edited

Loading

fschlimb commented Nov 21, 2018 •

edited

Loading

fschlimb commented Nov 21, 2018

oleksandr-pavlyk commented Nov 26, 2018

oleksandr-pavlyk commented Dec 6, 2018

anton-malakhov commented Dec 6, 2018

oleksandr-pavlyk commented Dec 6, 2018

oleksandr-pavlyk commented Dec 6, 2018

oleksandr-pavlyk commented Dec 6, 2018 •

edited

Loading

ogrisel commented Dec 7, 2018

ogrisel commented Dec 7, 2018

fschlimb commented Dec 7, 2018

fschlimb commented Dec 7, 2018

ogrisel Dec 10, 2018

oleksandr-pavlyk Dec 11, 2018

oleksandr-pavlyk commented Dec 13, 2018

ogrisel commented Dec 21, 2018 •

edited

Loading

oleksandr-pavlyk commented Jan 8, 2019

fschlimb Jan 9, 2019

oleksandr-pavlyk Jan 9, 2019

fschlimb commented Jan 10, 2019

oleksandr-pavlyk commented Jan 10, 2019

oleksandr-pavlyk commented Jan 10, 2019

		copy_X=copy_X, n_jobs=n_jobs)


		setattr(LinearRegression, 'fit', fit)

Intel(R) Distribution for Python's sklearn patches #15

Intel(R) Distribution for Python's sklearn patches #15

Conversation

oleksandr-pavlyk commented Nov 21, 2018

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Nov 21, 2018 • edited Loading

Choose a reason for hiding this comment

fschlimb commented Nov 21, 2018 • edited Loading

fschlimb commented Nov 21, 2018

oleksandr-pavlyk commented Nov 26, 2018

oleksandr-pavlyk commented Dec 6, 2018

anton-malakhov commented Dec 6, 2018

oleksandr-pavlyk commented Dec 6, 2018

oleksandr-pavlyk commented Dec 6, 2018

oleksandr-pavlyk commented Dec 6, 2018 • edited Loading

ogrisel commented Dec 7, 2018

ogrisel commented Dec 7, 2018

fschlimb commented Dec 7, 2018

fschlimb commented Dec 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oleksandr-pavlyk commented Dec 13, 2018

ogrisel commented Dec 21, 2018 • edited Loading

oleksandr-pavlyk commented Jan 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fschlimb commented Jan 10, 2019

oleksandr-pavlyk commented Jan 10, 2019

oleksandr-pavlyk commented Jan 10, 2019

ogrisel Nov 21, 2018 •

edited

Loading

fschlimb commented Nov 21, 2018 •

edited

Loading

oleksandr-pavlyk commented Dec 6, 2018 •

edited

Loading

ogrisel commented Dec 21, 2018 •

edited

Loading