[testing] Expand sklearnex testing to all methods taking "X" or "y" as input #1865

icfaust · 2024-06-13T05:06:19Z

Description

All testing under the sklearnex/tests folder use centralized estimator and method lists which were previously limited to methods based off of the various inherited sklearn Mixins. This tested all 'core' functionality like 'fit', 'predict', 'transform', etc. which are directly patched out by oneDAL functions. Some next level methods (those which use these core functionality with a small number of additional calculations) like score were included. However, this list was incomplete, as some estimators (like PCA and KNeighbors) implement special additional methods like inverse_transform and kneighbors_graph, which require special inputs and/or outputs. Up until now, these methods have not been tested using the centralized testing suite (and may have not been tested at all). This PR dynamically checks and collects all methods to all patched estimators which contain "X" or "y" in the input into testing, increasing flexibility and reducing maintenance.

The additional code comes from having to fix issues related to taking in dpnp and dptcl inputs since they were previously untested.

Rather than using _device_offload.dispatch which is code heavy, cumbersome and with significant overhead, a limited use of get_namespace was added to these failing methods which provide the necessary pre- or post-processing of 'core' oneDAL methods.

With these additions, dpnp and dpctl inputs are almost fully covered in the codebase. One deselection was made in radius_neighbors and radius_neighbors_graph in the NearestNeighbors algorithm, as these would require significant work (i.e. the implementation of a new estimator) in order to properly support dpnp/dpctl inputs (unless we knowingly only allow CPU computation).

The listed changes are as follows:

rename ensemble algorithms check_sample_weight method to _check_sample_weight to make it a private method
Change IncrementalEmpiricalCovariance mahalanobis and score methods to accept dpnp/dpctl inputs (score was previously missed as this estimator has no Mixins)
Provide error to NearestNeighbors algorithm for use of radius_neighbors and radius_neighbors_graph for non-numpy/pandas inputs, which are unsupported.
Remove radius_neighbors from KNeighborsRegressor and KNeighorsClassifier, as they do not support it from sklearn, and it was purely using sklearn functionality
Fix kneighbors_graph method of NearestNeighbors, KNeighborsRegressor, and KNeighborsClassifier to take in dpnp/dpctl inputs, and return a scipy.csr_matrix (removed wrap output data, etc.)
Disable score_samples and score methods in stability testing for PCA (will add tickets to address these issues).
Introduce PCA.inverse_transform using get_namespace to insure components_ and mean_ are on the same device as X

NOTE: current private CI fails for PCA in the inverse_transform method in dpnp testing, this is due to an out-of-date version of dpnp, and is currently getting addressed. Private CI will be re-run as soon as it is updated.

icfaust · 2024-06-13T05:12:43Z

/azp run CI

azure-pipelines · 2024-06-13T05:12:53Z

Azure Pipelines successfully started running 1 pipeline(s).

sklearnex/covariance/incremental_covariance.py

icfaust · 2024-06-25T10:07:51Z

/intelci: run

icfaust · 2024-06-25T21:42:38Z

/intelci: run

samir-nasibli

initially reviewed

samir-nasibli · 2024-06-26T14:06:59Z

sklearnex/covariance/incremental_covariance.py

+    @wrap_output_data
+    def score(self, X_test, y=None):
+        xp, _ = get_namespace(X_test)
+
+        location = self.location_
+        if sklearn_check_version("1.0"):
+            X = self._validate_data(
+                X_test,
+                dtype=[np.float64, np.float32],
+                reset=False,
+                copy=self.copy,
+            )
+        else:
+            X = check_array(
+                X_test,
+                dtype=[np.float64, np.float32],
+                copy=self.copy,
+            )
+
+        if "numpy" not in xp.__name__:
+            location = xp.asarray(location, device=X_test.device)
+            if isinstance(X, np.ndarray):
+                X = X_test
+
+        est = clone(self)
+        est.set_params(**{"assume_centered": True})
+


Score it wrapped out via wrap_output_data decorator. Here only numpy is used, are this non-numpy checks and branches are still required? I recommend remove them since not used.

The purpose of this PR is to verify all estimators with "X" or "y" will properly evaluate using the various dataframes and queues in all of our various tests (datatypes and stability currently, possibly memory leaks later). Score is implemented in the way it is in order to guarantee dpnp/dpctl tensor conformance.

The non-numpy branch is necessary because of the actions of _validate_data, which may return a numpy array, it may use array_api depending on the config of array_api_dispatch, as well as issues with dpnp not supporting array_api. This is very much dependent on the sklearn version as well.

samir-nasibli · 2024-06-26T14:33:44Z

sklearnex/covariance/incremental_covariance.py

+    @wrap_output_data
+    def score(self, X_test, y=None):


Generally, I would like to see this methods covered with tests with dpnp/dpctl inputs

I have now added a test which validates numerical conformance. However, the purpose of this PR was to extend checking with dpnp/dpctl to everything possible.

samir-nasibli · 2024-06-26T14:36:25Z

sklearnex/covariance/incremental_covariance.py

    def mahalanobis(self, X):
        if sklearn_check_version("1.0"):
-            self._validate_data(X, reset=False, copy=self.copy)
-        else:
-            check_array(X, copy=self.copy)
+            self._check_feature_names(X, reset=False)


Does it properly works with sycl usm ndarray inputs? Could you please point out tests where it called with dpnp/dpctl inputs?

This is the case throughout the LogisticRegression implementation:
It has been included before the dispatch, meaning it operates on the sycl usm ndarrays when provided.

https://github.com/intel/scikit-learn-intelex/blob/main/sklearnex/linear_model/logistic_regression.py#L111
Testing of fit in LogisticRegression:
https://github.com/intel/scikit-learn-intelex/blob/main/sklearnex/linear_model/tests/test_logreg.py#L50

_check_feature_names is in sklearn's BaseEstimator _validate_data: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/base.py#L654 It purposefully works with the data which hasn't been touched by check_array, meaning it is working on the raw input.

The purpose of this method is to validate aspects of pandas, and pandas-like dataframes, which could have names to features: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/base.py#L479 which the most important part is https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L2226

sklearnex/ensemble/_forest.py

samir-nasibli · 2024-06-26T15:43:42Z

sklearnex/neighbors/common.py

+        # construct CSR matrix representation of the k-NN graph
+        if mode == "connectivity":
+            A_ind = self.kneighbors(X, n_neighbors, return_distance=False)
+            xp, _ = get_namespace(A_ind)


In case of Array API style coding, I think it make sense early call at once get namespace for the input.
Wouldn't be it better to move on line 272 get namespace for input X.

What is difficult here is that X is a kwarg with a default value of None. In that case, it would depend on array type of the stored values in onedal_estimator: https://github.com/intel/scikit-learn-intelex/blob/main/onedal/neighbors/neighbors.py#L298 It is currently numpy arrays, but that's not guaranteed going forward. The most dependable way is to check the output of kneighbors, hence why its checked inside the if statement. I don't like it either, but its the only guaranteed safe way I could think of.

sklearnex/covariance/incremental_covariance.py

icfaust · 2024-06-26T22:53:49Z

/intelci: run

icfaust · 2024-06-27T07:11:24Z

sklearnex/decomposition/pca.py

@@ -210,6 +210,24 @@ def fit_transform(self, X, y=None):
                # Scikit-learn PCA["covariance_eigh"] was fit
                return self._transform(X_fit, xp, x_is_centered=x_is_centered)

+        @wrap_output_data
+        def inverse_transform(X):
+            xp, _ = get_namespace(X)


weakness in input checking matches issues in sklearn, currently a PR to fix it is in: scikit-learn/scikit-learn#29310

icfaust · 2024-06-27T07:12:47Z

/intelci: run

icfaust · 2024-06-27T07:26:57Z

http://intel-ci.intel.com/ef345679-7188-f16b-a099-a4bf010d0e2e

icfaust · 2024-06-27T08:32:23Z

/intelci: run

icfaust · 2024-06-27T08:52:59Z

http://intel-ci.intel.com/ef346297-5405-f1e1-ad44-a4bf010d0e2e

Vika-F

Thank you for improving dpnp/dpctl support and increasing the product's code coverage.
See a couple of my comments below.

Vika-F · 2024-06-27T09:46:09Z

sklearnex/decomposition/pca.py

+            if "numpy" not in xp.__name__:
+                components = xp.asarray(components, device=X.device)
+                mean = xp.asarray(mean, device=X.device)
+
+            return X @ components + mean


Please add a comment why the check and asarray conversion are required here.

Vika-F · 2024-06-27T09:50:13Z

sklearnex/neighbors/common.py

+        n_nonzero = n_queries * n_neighbors
+        A_indptr = xp.arange(0, n_nonzero + 1, n_neighbors)
+
+        kneighbors_graph = sp.csr_matrix(


I think it would be better to use sp.csr_array if possible.
Because scipy is in the process of switching to array interface. See the note here: https://docs.scipy.org/doc/scipy/reference/sparse.html

Vika-F · 2024-06-27T09:55:15Z

sklearnex/tests/test_run_to_run_stability.py

+        attr = getattr(est, method)
+        if method == "inverse_transform":
+            # PCA's inverse_transform takes (n_samples, n_components)
+            data = (
+                (X[:, : est.n_components_],) if X.shape[1] != est.n_components_ else (X,)
+            )
+        elif method not in ["score", "partial_fit", "path"]:
+            data = (X,)
        else:
-            res = est.score(X, y)
+            data = (X, y)


This code is duplicated here and in sklearnex/tests/test_patching.py.
Is it possible to have it in some common place?

icfaust added 4 commits June 13, 2024 06:44

Update test_n_jobs_support.py

a7cafc0

Update _utils.py

b1b18d0

Update _utils.py

43566fa

linting

c7f8c48

icfaust added 24 commits June 13, 2024 13:07

Update _utils.py

96a11fe

Update _forest.py

8a3838d

formatting

9c987a1

fix mistake

e5be1b8

Update _utils.py

04abf05

Update knn_classification.py

5245145

Update knn_regression.py

8054555

Update test_patching.py

1e0fc72

Update knn_regression.py

39d4733

Update knn_classification.py

65d9f4d

Update common.py

21d084a

Update knn_classification.py

ac21049

Update knn_regression.py

1aebb52

Update knn_unsupervised.py

ffed40b

formatting

4348fcd

Update common.py

333f753

formatting

3d41af0

Update incremental_covariance.py

3b29ac6

Update incremental_covariance.py

aea8ae7

Update common.py

8c8de8f

Update common.py

f40b421

Update incremental_covariance.py

5b19a6a

Update incremental_covariance.py

5347e3e

Update common.py

6773e68

icfaust commented Jun 25, 2024

View reviewed changes

sklearnex/covariance/incremental_covariance.py Show resolved Hide resolved

icfaust added 2 commits June 25, 2024 11:37

Update incremental_covariance.py

d998de4

firstpass to false

30471f4

Update incremental_covariance.py

1421f3e

samir-nasibli reviewed Jun 26, 2024

View reviewed changes

md-shafiul-alam reviewed Jun 26, 2024

View reviewed changes

sklearnex/covariance/incremental_covariance.py Show resolved Hide resolved

md-shafiul-alam reviewed Jun 26, 2024

View reviewed changes

sklearnex/covariance/incremental_covariance.py Outdated Show resolved Hide resolved

icfaust added 11 commits June 26, 2024 13:49

add requested test

f89b40d

forgotten import

b8fc51c

remove regex

f42e40a

fix assert_allclose

366ddff

forgotten removal

5db8f84

fix test

6cb9c04

fix test

e124dc8

fix tests

b8be3fa

fix tests

cd20703

fix tests

9a6ffc6

add necessary comment

9748496

inverse_transform changes for newest dpctl/dpnp

8e525e0

icfaust commented Jun 27, 2024

View reviewed changes

forgotten self

c7954b9

location to components

d5c4f75

Vika-F reviewed Jun 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[testing] Expand sklearnex testing to all methods taking "X" or "y" as input #1865

[testing] Expand sklearnex testing to all methods taking "X" or "y" as input #1865

icfaust commented Jun 13, 2024 •

edited

Loading

icfaust commented Jun 13, 2024

azure-pipelines bot commented Jun 13, 2024

icfaust commented Jun 25, 2024

icfaust commented Jun 25, 2024

samir-nasibli left a comment

samir-nasibli Jun 26, 2024

icfaust Jun 26, 2024 •

edited

Loading

samir-nasibli Jun 26, 2024

icfaust Jun 26, 2024

samir-nasibli Jun 26, 2024

icfaust Jun 26, 2024 •

edited

Loading

samir-nasibli Jun 26, 2024

icfaust Jun 26, 2024 •

edited

Loading

icfaust commented Jun 26, 2024

icfaust Jun 27, 2024

icfaust commented Jun 27, 2024

icfaust commented Jun 27, 2024

icfaust commented Jun 27, 2024

icfaust commented Jun 27, 2024

Vika-F left a comment

Vika-F Jun 27, 2024 •

edited

Loading

Vika-F Jun 27, 2024

Vika-F Jun 27, 2024

[testing] Expand sklearnex testing to all methods taking "X" or "y" as input #1865

Are you sure you want to change the base?

[testing] Expand sklearnex testing to all methods taking "X" or "y" as input #1865

Conversation

icfaust commented Jun 13, 2024 • edited Loading

Description

icfaust commented Jun 13, 2024

azure-pipelines bot commented Jun 13, 2024

icfaust commented Jun 25, 2024

icfaust commented Jun 25, 2024

samir-nasibli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icfaust Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icfaust Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icfaust Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

icfaust commented Jun 26, 2024

Choose a reason for hiding this comment

icfaust commented Jun 27, 2024

icfaust commented Jun 27, 2024

icfaust commented Jun 27, 2024

icfaust commented Jun 27, 2024

Vika-F left a comment

Choose a reason for hiding this comment

Vika-F Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icfaust commented Jun 13, 2024 •

edited

Loading

icfaust Jun 26, 2024 •

edited

Loading

icfaust Jun 26, 2024 •

edited

Loading

icfaust Jun 26, 2024 •

edited

Loading

Vika-F Jun 27, 2024 •

edited

Loading