-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[testing] Expand sklearnex testing to all methods taking "X" or "y" as input #1865
base: main
Are you sure you want to change the base?
Conversation
/azp run CI |
Azure Pipelines successfully started running 1 pipeline(s). |
/intelci: run |
/intelci: run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initially reviewed
@wrap_output_data | ||
def score(self, X_test, y=None): | ||
xp, _ = get_namespace(X_test) | ||
|
||
location = self.location_ | ||
if sklearn_check_version("1.0"): | ||
X = self._validate_data( | ||
X_test, | ||
dtype=[np.float64, np.float32], | ||
reset=False, | ||
copy=self.copy, | ||
) | ||
else: | ||
X = check_array( | ||
X_test, | ||
dtype=[np.float64, np.float32], | ||
copy=self.copy, | ||
) | ||
|
||
if "numpy" not in xp.__name__: | ||
location = xp.asarray(location, device=X_test.device) | ||
if isinstance(X, np.ndarray): | ||
X = X_test | ||
|
||
est = clone(self) | ||
est.set_params(**{"assume_centered": True}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Score it wrapped out via wrap_output_data
decorator. Here only numpy is used, are this non-numpy checks and branches are still required? I recommend remove them since not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this PR is to verify all estimators with "X" or "y" will properly evaluate using the various dataframes and queues in all of our various tests (datatypes and stability currently, possibly memory leaks later). Score is implemented in the way it is in order to guarantee dpnp/dpctl tensor conformance.
The non-numpy branch is necessary because of the actions of _validate_data, which may return a numpy array, it may use array_api depending on the config of array_api_dispatch, as well as issues with dpnp not supporting array_api. This is very much dependent on the sklearn version as well.
@wrap_output_data | ||
def score(self, X_test, y=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, I would like to see this methods covered with tests with dpnp/dpctl inputs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have now added a test which validates numerical conformance. However, the purpose of this PR was to extend checking with dpnp/dpctl to everything possible.
def mahalanobis(self, X): | ||
if sklearn_check_version("1.0"): | ||
self._validate_data(X, reset=False, copy=self.copy) | ||
else: | ||
check_array(X, copy=self.copy) | ||
self._check_feature_names(X, reset=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it properly works with sycl usm ndarray inputs? Could you please point out tests where it called with dpnp/dpctl inputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the case throughout the LogisticRegression implementation:
It has been included before the dispatch, meaning it operates on the sycl usm ndarrays when provided.
https://github.com/intel/scikit-learn-intelex/blob/main/sklearnex/linear_model/logistic_regression.py#L111
Testing of fit in LogisticRegression:
https://github.com/intel/scikit-learn-intelex/blob/main/sklearnex/linear_model/tests/test_logreg.py#L50
_check_feature_names is in sklearn's BaseEstimator _validate_data: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/base.py#L654 It purposefully works with the data which hasn't been touched by check_array, meaning it is working on the raw input.
The purpose of this method is to validate aspects of pandas, and pandas-like dataframes, which could have names to features: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/base.py#L479 which the most important part is https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/validation.py#L2226
# construct CSR matrix representation of the k-NN graph | ||
if mode == "connectivity": | ||
A_ind = self.kneighbors(X, n_neighbors, return_distance=False) | ||
xp, _ = get_namespace(A_ind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of Array API style coding, I think it make sense early call at once get namespace for the input.
Wouldn't be it better to move on line 272 get namespace for input X.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is difficult here is that X is a kwarg with a default value of None. In that case, it would depend on array type of the stored values in onedal_estimator: https://github.com/intel/scikit-learn-intelex/blob/main/onedal/neighbors/neighbors.py#L298 It is currently numpy arrays, but that's not guaranteed going forward. The most dependable way is to check the output of kneighbors, hence why its checked inside the if statement. I don't like it either, but its the only guaranteed safe way I could think of.
/intelci: run |
@@ -210,6 +210,24 @@ def fit_transform(self, X, y=None): | |||
# Scikit-learn PCA["covariance_eigh"] was fit | |||
return self._transform(X_fit, xp, x_is_centered=x_is_centered) | |||
|
|||
@wrap_output_data | |||
def inverse_transform(X): | |||
xp, _ = get_namespace(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weakness in input checking matches issues in sklearn, currently a PR to fix it is in: scikit-learn/scikit-learn#29310
/intelci: run |
/intelci: run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for improving dpnp/dpctl support and increasing the product's code coverage.
See a couple of my comments below.
if "numpy" not in xp.__name__: | ||
components = xp.asarray(components, device=X.device) | ||
mean = xp.asarray(mean, device=X.device) | ||
|
||
return X @ components + mean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment why the check and asarray conversion are required here.
n_nonzero = n_queries * n_neighbors | ||
A_indptr = xp.arange(0, n_nonzero + 1, n_neighbors) | ||
|
||
kneighbors_graph = sp.csr_matrix( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to use sp.csr_array
if possible.
Because scipy is in the process of switching to array interface. See the note here: https://docs.scipy.org/doc/scipy/reference/sparse.html
attr = getattr(est, method) | ||
if method == "inverse_transform": | ||
# PCA's inverse_transform takes (n_samples, n_components) | ||
data = ( | ||
(X[:, : est.n_components_],) if X.shape[1] != est.n_components_ else (X,) | ||
) | ||
elif method not in ["score", "partial_fit", "path"]: | ||
data = (X,) | ||
else: | ||
res = est.score(X, y) | ||
data = (X, y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is duplicated here and in sklearnex/tests/test_patching.py
.
Is it possible to have it in some common place?
Description
All testing under the
sklearnex/tests
folder use centralized estimator and method lists which were previously limited to methods based off of the various inherited sklearn Mixins. This tested all 'core' functionality like 'fit', 'predict', 'transform', etc. which are directly patched out by oneDAL functions. Some next level methods (those which use these core functionality with a small number of additional calculations) likescore
were included. However, this list was incomplete, as some estimators (like PCA and KNeighbors) implement special additional methods likeinverse_transform
andkneighbors_graph
, which require special inputs and/or outputs. Up until now, these methods have not been tested using the centralized testing suite (and may have not been tested at all). This PR dynamically checks and collects all methods to all patched estimators which contain "X" or "y" in the input into testing, increasing flexibility and reducing maintenance.The additional code comes from having to fix issues related to taking in dpnp and dptcl inputs since they were previously untested.
Rather than using
_device_offload.dispatch
which is code heavy, cumbersome and with significant overhead, a limited use ofget_namespace
was added to these failing methods which provide the necessary pre- or post-processing of 'core' oneDAL methods.With these additions, dpnp and dpctl inputs are almost fully covered in the codebase. One deselection was made in
radius_neighbors
andradius_neighbors_graph
in the NearestNeighbors algorithm, as these would require significant work (i.e. the implementation of a new estimator) in order to properly support dpnp/dpctl inputs (unless we knowingly only allow CPU computation).The listed changes are as follows:
check_sample_weight
method to_check_sample_weight
to make it a private methodmahalanobis
andscore
methods to accept dpnp/dpctl inputs (score
was previously missed as this estimator has no Mixins)radius_neighbors
andradius_neighbors_graph
for non-numpy/pandas inputs, which are unsupported.kneighbors_graph
method ofNearestNeighbors
,KNeighborsRegressor
, andKNeighborsClassifier
to take in dpnp/dpctl inputs, and return a scipy.csr_matrix (removed wrap output data, etc.)score_samples
andscore
methods in stability testing for PCA (will add tickets to address these issues).NOTE: current private CI fails for PCA in the
inverse_transform
method in dpnp testing, this is due to an out-of-date version of dpnp, and is currently getting addressed. Private CI will be re-run as soon as it is updated.