Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[testing] Expand sklearnex testing to all methods taking "X" or "y" as input #1865

Open
wants to merge 97 commits into
base: main
Choose a base branch
from

Conversation

icfaust
Copy link
Contributor

@icfaust icfaust commented Jun 13, 2024

Description

All testing under the sklearnex/tests folder use centralized estimator and method lists which were previously limited to methods based off of the various inherited sklearn Mixins. This tested all 'core' functionality like 'fit', 'predict', 'transform', etc. which are directly patched out by oneDAL functions. Some next level methods (those which use these core functionality with a small number of additional calculations) like score were included. However, this list was incomplete, as some estimators (like PCA and KNeighbors) implement special additional methods like inverse_transform and kneighbors_graph, which require special inputs and/or outputs. Up until now, these methods have not been tested using the centralized testing suite (and may have not been tested at all). This PR dynamically checks and collects all methods to all patched estimators which contain "X" or "y" in the input into testing, increasing flexibility and reducing maintenance.

The additional code comes from having to fix issues related to taking in dpnp and dptcl inputs since they were previously untested.

Rather than using _device_offload.dispatch which is code heavy, cumbersome and with significant overhead, a limited use of get_namespace was added to these failing methods which provide the necessary pre- or post-processing of 'core' oneDAL methods.

With these additions, dpnp and dpctl inputs are almost fully covered in the codebase. One deselection was made in radius_neighbors and radius_neighbors_graph in the NearestNeighbors algorithm, as these would require significant work (i.e. the implementation of a new estimator) in order to properly support dpnp/dpctl inputs (unless we knowingly only allow CPU computation).

The listed changes are as follows:

  • rename ensemble algorithms check_sample_weight method to _check_sample_weight to make it a private method
  • Change IncrementalEmpiricalCovariance mahalanobis and score methods to accept dpnp/dpctl inputs (score was previously missed as this estimator has no Mixins)
  • Provide error to NearestNeighbors algorithm for use of radius_neighbors and radius_neighbors_graph for non-numpy/pandas inputs, which are unsupported.
  • Remove radius_neighbors from KNeighborsRegressor and KNeighorsClassifier, as they do not support it from sklearn, and it was purely using sklearn functionality
  • Fix kneighbors_graph method of NearestNeighbors, KNeighborsRegressor, and KNeighborsClassifier to take in dpnp/dpctl inputs, and return a scipy.csr_matrix (removed wrap output data, etc.)
  • Disable score_samples and score methods in stability testing for PCA (will add tickets to address these issues).
  • Introduce PCA.inverse_transform using get_namespace to insure components_ and mean_ are on the same device as X

NOTE: current private CI fails for PCA in the inverse_transform method in dpnp testing, this is due to an out-of-date version of dpnp, and is currently getting addressed. Private CI will be re-run as soon as it is updated.

@icfaust
Copy link
Contributor Author

icfaust commented Jun 13, 2024

/azp run CI

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@icfaust
Copy link
Contributor Author

icfaust commented Jun 25, 2024

/intelci: run

@icfaust
Copy link
Contributor Author

icfaust commented Jun 25, 2024

/intelci: run

Copy link
Contributor

@samir-nasibli samir-nasibli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initially reviewed

Comment on lines 202 to 228
@wrap_output_data
def score(self, X_test, y=None):
xp, _ = get_namespace(X_test)

location = self.location_
if sklearn_check_version("1.0"):
X = self._validate_data(
X_test,
dtype=[np.float64, np.float32],
reset=False,
copy=self.copy,
)
else:
X = check_array(
X_test,
dtype=[np.float64, np.float32],
copy=self.copy,
)

if "numpy" not in xp.__name__:
location = xp.asarray(location, device=X_test.device)
if isinstance(X, np.ndarray):
X = X_test

est = clone(self)
est.set_params(**{"assume_centered": True})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Score it wrapped out via wrap_output_data decorator. Here only numpy is used, are this non-numpy checks and branches are still required? I recommend remove them since not used.

Copy link
Contributor Author

@icfaust icfaust Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this PR is to verify all estimators with "X" or "y" will properly evaluate using the various dataframes and queues in all of our various tests (datatypes and stability currently, possibly memory leaks later). Score is implemented in the way it is in order to guarantee dpnp/dpctl tensor conformance.

The non-numpy branch is necessary because of the actions of _validate_data, which may return a numpy array, it may use array_api depending on the config of array_api_dispatch, as well as issues with dpnp not supporting array_api. This is very much dependent on the sklearn version as well.

Comment on lines +202 to +203
@wrap_output_data
def score(self, X_test, y=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I would like to see this methods covered with tests with dpnp/dpctl inputs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now added a test which validates numerical conformance. However, the purpose of this PR was to extend checking with dpnp/dpctl to everything possible.

Comment on lines 331 to +333
def mahalanobis(self, X):
if sklearn_check_version("1.0"):
self._validate_data(X, reset=False, copy=self.copy)
else:
check_array(X, copy=self.copy)
self._check_feature_names(X, reset=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it properly works with sycl usm ndarray inputs? Could you please point out tests where it called with dpnp/dpctl inputs?

Copy link
Contributor Author

@icfaust icfaust Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the case throughout the LogisticRegression implementation:
It has been included before the dispatch, meaning it operates on the sycl usm ndarrays when provided.

https://github.com/intel/scikit-learn-intelex/blob/main/sklearnex/linear_model/logistic_regression.py#L111
Testing of fit in LogisticRegression:
https://github.com/intel/scikit-learn-intelex/blob/main/sklearnex/linear_model/tests/test_logreg.py#L50

_check_feature_names is in sklearn's BaseEstimator _validate_data: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/base.py#L654 It purposefully works with the data which hasn't been touched by check_array, meaning it is working on the raw input.

The purpose of this method is to validate aspects of pandas, and pandas-like dataframes, which could have names to features: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/base.py#L479 which the most important part is https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L2226

sklearnex/ensemble/_forest.py Show resolved Hide resolved
# construct CSR matrix representation of the k-NN graph
if mode == "connectivity":
A_ind = self.kneighbors(X, n_neighbors, return_distance=False)
xp, _ = get_namespace(A_ind)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of Array API style coding, I think it make sense early call at once get namespace for the input.
Wouldn't be it better to move on line 272 get namespace for input X.

Copy link
Contributor Author

@icfaust icfaust Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is difficult here is that X is a kwarg with a default value of None. In that case, it would depend on array type of the stored values in onedal_estimator: https://github.com/intel/scikit-learn-intelex/blob/main/onedal/neighbors/neighbors.py#L298 It is currently numpy arrays, but that's not guaranteed going forward. The most dependable way is to check the output of kneighbors, hence why its checked inside the if statement. I don't like it either, but its the only guaranteed safe way I could think of.

@icfaust
Copy link
Contributor Author

icfaust commented Jun 26, 2024

/intelci: run

@@ -210,6 +210,24 @@ def fit_transform(self, X, y=None):
# Scikit-learn PCA["covariance_eigh"] was fit
return self._transform(X_fit, xp, x_is_centered=x_is_centered)

@wrap_output_data
def inverse_transform(X):
xp, _ = get_namespace(X)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weakness in input checking matches issues in sklearn, currently a PR to fix it is in: scikit-learn/scikit-learn#29310

@icfaust
Copy link
Contributor Author

icfaust commented Jun 27, 2024

/intelci: run

@icfaust
Copy link
Contributor Author

icfaust commented Jun 27, 2024

@icfaust
Copy link
Contributor Author

icfaust commented Jun 27, 2024

/intelci: run

@icfaust
Copy link
Contributor Author

icfaust commented Jun 27, 2024

Copy link
Contributor

@Vika-F Vika-F left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for improving dpnp/dpctl support and increasing the product's code coverage.
See a couple of my comments below.

Comment on lines +225 to +229
if "numpy" not in xp.__name__:
components = xp.asarray(components, device=X.device)
mean = xp.asarray(mean, device=X.device)

return X @ components + mean
Copy link
Contributor

@Vika-F Vika-F Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment why the check and asarray conversion are required here.

n_nonzero = n_queries * n_neighbors
A_indptr = xp.arange(0, n_nonzero + 1, n_neighbors)

kneighbors_graph = sp.csr_matrix(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to use sp.csr_array if possible.
Because scipy is in the process of switching to array interface. See the note here: https://docs.scipy.org/doc/scipy/reference/sparse.html

Comment on lines +76 to +85
attr = getattr(est, method)
if method == "inverse_transform":
# PCA's inverse_transform takes (n_samples, n_components)
data = (
(X[:, : est.n_components_],) if X.shape[1] != est.n_components_ else (X,)
)
elif method not in ["score", "partial_fit", "path"]:
data = (X,)
else:
res = est.score(X, y)
data = (X, y)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is duplicated here and in sklearnex/tests/test_patching.py.
Is it possible to have it in some common place?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants