-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intel(R) Distribution for Python's sklearn patches #15
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great.
May I suggest to change the package hierarchy as follows:
- daal4py/
- [other daal4py packages]
- sklearn/
- decision_forest.py # provides sklearn compatible estimator without having to patch
- monkeypatch/ # has all the tools to monkeypatch the top level sklearn namespace.
daal4py/sklearn_patches/__main__.py
Outdated
# disclosure or delivery of the Materials, either expressly, by | ||
# implication, inducement, estoppel or otherwise. Any license under such | ||
# intellectual property rights must be express and approved by Intel in | ||
# writing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This license header seems to contradict the ASL license of the daal4py repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should become Apache.
|
||
def enable(name=None): | ||
if sklearn_version != "0.20.0": | ||
raise NotImplementedError("daal4sklearn is for scikit-learn 0.20.0 only, found version {0}".format(sklearn_version)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could only raise NotImplementError is the scikit-learn version is lower than 0.20.0 and only issue a UserWarning when scikit-learn is more recent?
This would make it easier to test the patch on the development version of scikit-learn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How difficult is it to include our patches for 0.19?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following my be handy for this:
from distutils.version import LooseVersion
if LooseVersion(sklearn_version) < LooseVersion("0.20.0"):
raise NotImplementedError("daal4sklearn is for scikit-learn 0.20.0 only ...")
elif LooseVersion(sklearn_version) > LooseVersion("0.20.0"):
warnings.warn("daal4sklearn {daal4py_version} has only been tested with scikit-learn 0.20.0, found version...")
Très cool. I suggest extending @ogrisel's structure to
This makes the '-m' feature more convienent and becomes '-m daal4py'. Also, I suggest adding a function to the daal4py package so user can do
additionally or alternatively, we might want to add patch_skl.py so the following would be enough:
Also, please find a better name than 'daal4py_utils.py', maybe simply 'utils.py'. |
We also need tests and documentation. |
20affc6
to
a4d5673
Compare
I have restructured the code as suggested. Monkey patching can now be enabled via I have used |
a4d5673
to
903977d
Compare
Running
in an environment with 0.20.1 installed currently produces 8 failures. One of them in |
why not to rename the module as just |
@anton-malakhov That would require renaming |
903977d
to
f0a2134
Compare
@ogrisel I looked into why Consider the following snippet: # test_parallel_classifier.py
import sklearn
import sklearn.datasets
import sklearn.model_selection
import sklearn.utils
from sklearn.ensemble import BaggingClassifier
from sklearn.svm import SVC
iris = sklearn.datasets.load_iris()
rng = sklearn.utils.check_random_state(0)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
iris.data, iris.target, random_state=rng)
backend = 'loky'
# backend = 'multiprocessing'
with sklearn.utils.parallel_backend(backend):
base_est = SVC(gamma='scale', decision_function_shape='ovr')
ensemble = BaggingClassifier(base_est, n_jobs = 2, random_state=0).fit(X_train, y_train)
print(ensemble.estimators_[0]._dual_coef_) Execution vanilla scikit-learn with Running the above snippet with monkey-patched code,
The following is my account of what transpires to the best of my understanding.
This problem is not observable if Hence questions:
Thanks for your input. |
If monkey patching line is added to
The other failing test is The cluster of 5 SVC-related failures all have one thing in common. The dataset contains duplicate samples, or bagging results in choosing a subset of features, for which such duplicate samples appear. I feel compelled to point out that such inputs lead to quadratic optimization problem with non-positive definite matrix I was able to resolve some of these failures by strengthening tolerance parameter: diff --git a/sklearn/ensemble/tests/test_bagging.py b/sklearn/ensemble/tests/test_bagging.py
index 608df3dc43..5dcfd4dbea 100644
--- a/sklearn/ensemble/tests/test_bagging.py
+++ b/sklearn/ensemble/tests/test_bagging.py
@@ -118,7 +118,7 @@ def test_sparse_classification():
for f in ['predict', 'predict_proba', 'predict_log_proba', 'decision_function']:
# Trained on sparse format
sparse_classifier = BaggingClassifier(
- base_estimator=CustomSVC(gamma='scale',
+ base_estimator=CustomSVC(gamma='scale', tol=1e-8,
decision_function_shape='ovr'),
random_state=1,
**params
@@ -127,7 +127,7 @@ def test_sparse_classification():
# Trained on dense format
dense_classifier = BaggingClassifier(
- base_estimator=CustomSVC(gamma='scale',
+ base_estimator=CustomSVC(gamma='scale', tol=1e-8,
decision_function_shape='ovr'),
random_state=1,
**params and sometimes additionally loosening out fuzz: diff --git a/sklearn/svm/tests/test_sparse.py b/sklearn/svm/tests/test_sparse.py
index ce14bda1db..33f695538a 100644
--- a/sklearn/svm/tests/test_sparse.py
+++ b/sklearn/svm/tests/test_sparse.py
@@ -138,17 +138,17 @@ def test_svc_with_custom_kernel():
def test_svc_iris():
# Test the sparse SVC with the iris dataset
for k in ('linear', 'poly', 'rbf'):
- sp_clf = svm.SVC(gamma='scale', kernel=k).fit(iris.data, iris.target)
- clf = svm.SVC(gamma='scale', kernel=k).fit(iris.data.toarray(),
+ sp_clf = svm.SVC(gamma='scale', kernel=k, tol=1e-10).fit(iris.data, iris.target)
+ clf = svm.SVC(gamma='scale', kernel=k, tol=1e-10).fit(iris.data.toarray(),
iris.target)
assert_array_almost_equal(clf.support_vectors_,
sp_clf.support_vectors_.toarray())
- assert_array_almost_equal(clf.dual_coef_, sp_clf.dual_coef_.toarray())
+ assert_array_almost_equal(clf.dual_coef_, sp_clf.dual_coef_.toarray(), decimal=4, err_msg=k)
assert_array_almost_equal(
clf.predict(iris.data.toarray()), sp_clf.predict(iris.data))
if k == 'linear':
- assert_array_almost_equal(clf.coef_, sp_clf.coef_.toarray())
+ assert_array_almost_equal(clf.coef_, sp_clf.coef_.toarray(), decimal=4)
def test_sparse_decision_function():
@@ -310,11 +310,12 @@ def test_sparse_realdata():
3., 0., 0., 2., 2., 1., 3., 1., 1., 0., 1., 2., 1.,
1., 3.])
- clf = svm.SVC(kernel='linear').fit(X.toarray(), y)
- sp_clf = svm.SVC(kernel='linear').fit(sparse.coo_matrix(X), y)
+ clf = svm.SVC(kernel='linear', tol=1e-10).fit(X.toarray(), y)
+ sp_clf = svm.SVC(kernel='linear', tol=1e-10).fit(sparse.coo_matrix(X), y)
+ assert_array_equal(clf.support_, sp_clf.support_)
assert_array_equal(clf.support_vectors_, sp_clf.support_vectors_.toarray())
- assert_array_equal(clf.dual_coef_, sp_clf.dual_coef_.toarray())
+ assert_array_almost_equal(clf.dual_coef_, sp_clf.dual_coef_.toarray())
def test_sparse_svc_clone_with_callable_kernel(): I'd like to argue that such failures are not misleading, and I'd suggest to add noise to the input data used for SVM to resolve any sample duplicates. Perhaps I should open a separate discussion issue to this effect in scikit-learn project. |
For the first issue (the monkeypatch that does not work with loky workers and arguably also with multiprocessing workers under Windows), I will try to think about a possible solution, maybe using class inheritance. I have some ideas but it requires some experimentation. For the support vector machine indentifiability issue with duplicate samples, thank you very much for your analysis. I am currently building daal4py to reproduce the issue on my laptop and see the problem in more details and devise what is the best course of action (decreasing tol or changing the dataset to avoid duplicated samples). |
Before merging this PR, I think it would be good to change the travis configuration to launch the sklearn tests for the last stable vanilla release of scikit-learn with the code of the patches daal4py from this repo. By looking at the current travis config, it seems that the daal4py tests them selves are not run by travis. Is this intentional? Or are they run automatically by conda build? Latter we can also add a travis cron job that does the same against scikit-learn master. |
Yes, the tests are run by conda build which is called in travis CI. |
I'd like to have travis only run a the subset of sklearn tests that are affected by the daal4py. It should finish in reasonable time. |
if LooseVersion(sklearn_version) < LooseVersion("0.20.0"): | ||
raise NotImplementedError("daal4sklearn is for scikit-learn 0.20.0 only ...") | ||
elif LooseVersion(sklearn_version) > LooseVersion("0.20.1"): | ||
warnings.warn("daal4sklearn {daal4py_version} has only been tested with scikit-learn 0.20.0, found version...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warnings
is not imported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll fix it in a moment.
of scikit-learn classes as a stand-alone module.
Also formatting changes to make lines shorter
…ould not be setting them, but setting _internal_dual_coef_ and _internal_intercept_ instead
8797f04
to
8689d0b
Compare
@ogrisel I tried to implement a work-around for the issue of interaction between loky and dynamic monkey-patching where workers end up being unpatched, and issues arise when combining worker results (trained instances) and clones of patches class on the host. That can be found in branch
stemming from exception messages now containing the name of the class used to replace
Do you have any suggestions on how to work around these issues, or perhaps change the test to allow for patched test to pass tests. Thank you |
Wouldn't be possible to keep the same class names using a strategy that looks like the following: from sklearn.svm import SVC as SVC_sklearn
class SVC(SVC_sklearn):
# put the patched methods here
install_patch():
from sklearn import svm
sklearn.svm.SVC = SVC |
@ogrisel This approach works well except in general. It runs into trouble with pytest and Ridge though, perhaps unable to handle uses of I also updated pairwise distance computation to incorporate fixes from upstream. |
copy_X=copy_X, n_jobs=n_jobs) | ||
|
||
|
||
setattr(LinearRegression, 'fit', fit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason why you do not define the functions inline?
The 'normal' way is to define class-methods inline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less copy-pasting is the only reason.
I merged with trunk to separate the fix for low order moments, coming up next. |
Thanks @fschlimb . I can confirm that merging tc/parse_moments into this branch the problem I was experiencing with |
This branch was merged in error. Master and the branch behind this request was force pushed to revert that. Regrettably there is no way to reopen this PR. So I reopened it as #35 |
Native KMeans: fix wrong parameters passed to DAAL prediction object Native logistic regression: scale DAAL's loss function value and gradient by n_samples daal4py logistic regression: don't compute loss function value and gradient twice * Align kmeans native benchmark to sklearn * Don't evaluate func, grad if not necessary * Scale loss value and gradient like sklearn does * Specify maxIterations, accuracyThreshold for multi_class_classifier * Fix indentation in Makefiles * Fix logic to select n_features_per_node in daal4py df_regr * Specifically say memorySavingMode=false in native df_regr * Reformat native df_regr bench
of scikit-learn classes as a stand-alone module.
@fschlimb @anton-malakhov @ogrisel
They can be invoked via
or by explicitly enabling them via
Names, design, etc. are up for discussion.