Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch predictions for nearest neighbors #293

Merged
merged 26 commits into from
Nov 8, 2019
Merged

Batch predictions for nearest neighbors #293

merged 26 commits into from
Nov 8, 2019

Conversation

xadupre
Copy link
Collaborator

@xadupre xadupre commented Oct 18, 2019

No description provided.

@@ -5,498 +5,236 @@
# --------------------------------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change the entire KNN converter implementation to support batch prediction?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to reuse the cdist optimisation I made for the GaussianProcess. It was faster that way.


if training_labels.dtype == np.int32:
training_labels = training_labels.astype(np.int64)
extracted = OnnxArrayFeatureExtractor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is OnnxArrayFeatureExtractor different from ArrayFeatureExtractor? The reason why KNN converter couldn't handle batch prediction was ArrayFeatureExtractor as they can only handle one homogenous indices(Y is a tensor of int), which means you can't extract different values for different test example(based on their nearest neightbour).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the same. It is used to extract the neighbors labels. It is equivalent to GatherElements if the target dimension is one but still needed if the target dimension is more than one. The previous version had more than one ArrayFeatureExtractor.

@@ -150,3 +153,33 @@ def _onnx_cdist_minkowski(X, Y, dtype=None, op_version=None, p=2, **kwargs):
op_version=op_version)
return OnnxTranspose(node[1], perm=[1, 0], op_version=op_version,
**kwargs)


def _onnx_cdist_manhattan(X, Y, dtype=None, op_version=None, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to implement such a function for every metric we want to support?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seemed easy to do. I refactored to avoid duplicated code.


def _onnx_cdist_manhattan(X, Y, dtype=None, op_version=None, **kwargs):
"""
Returns the ONNX graph which computes the Minkowski distance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minkowski distance?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manhattan. Sorry, wrong copy paste.

from ..common.data_types import Int64TensorType
from ..algebra.onnx_ops import (
OnnxTopK, OnnxMul, OnnxArrayFeatureExtractor,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easier to locate these if sorted, given there are so many.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def _get_weights(scope, container, topk_values_name, distance_power):
Retrieves the nearest neigbours *ONNX*.
:param X: features or *OnnxOperatorMixin*
:param Y: neighbours or *OnnxOperatorMixin*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Y can be confusing here. Can you name them appropriately instead of X and Y.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced by XA, XB

opv = container.target_opset
dtype = container.dtype

if X.type.__class__ == Int64TensorType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use isinstance()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all cdist return integer, it seems easier to cast now than after.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant can't you write:
if isinstance(X, Int64TensorType)
instead of
if X.type.class == Int64TensorType

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

k = op.n_neighbors
training_labels = op._y if hasattr(op, '_y') else None
distance_kwargs = {}
if metric == 'minkowski':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't we handling this with cdist?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefered to have a cdist function which follow the scipy implementation and the converter choose a shorter path if it is more appropriate.

@@ -61,14 +61,14 @@ def _onnx_squareform_pdist_sqeuclidean(X, dtype=None, op_version=None,
return node[1]


def onnx_cdist(X, Y, metric='sqeuclidean', dtype=None,
def onnx_cdist(XA, XB, metric='sqeuclidean', dtype=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason of having the names in capitals, pep8 recommends variable names to be in lower case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reused the same names scipy is using.

opv = container.target_opset
dtype = container.dtype

if X.type.__class__ == Int64TensorType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant can't you write:
if isinstance(X, Int64TensorType)
instead of
if X.type.class == Int64TensorType

@lgtm-com
Copy link

lgtm-com bot commented Nov 5, 2019

This pull request introduces 4 alerts when merging 47f0afa into 4b769c5 - view on LGTM.com

new alerts:

  • 3 for Module is imported with 'import' and 'import from'
  • 1 for 'import *' may pollute namespace

@prabhat00155
Copy link
Contributor

I get error with the int labels in regressor:

data = load_digits()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = KNeighborsRegressor().fit(X_train, y_train)

model_onnx = convert_sklearn(model, 'knn', [('input', Int64TensorType([None, X_test.shape[1]]))])
save_model(model_onnx, 'knn.onnx')
sess = InferenceSession('knn.onnx')


Fail Traceback (most recent call last)
in
7 model_onnx = convert_sklearn(model, 'knn', [('input', Int64TensorType([None, X_test.shape[1]]))])
8 save_model(model_onnx, 'knn.onnx')
----> 9 sess = InferenceSession('knn.onnx')

~/Documents/MachineLearning/onnx_projects/tmp_env/lib/python3.6/site-packages/onnxruntime/capi/session.py in init(self, path_or_bytes, sess_options)
21 self._path_or_bytes = path_or_bytes
22 self._sess_options = sess_options
---> 23 self._load_model()
24 self._enable_fallback = True
25

~/Documents/MachineLearning/onnx_projects/tmp_env/lib/python3.6/site-packages/onnxruntime/capi/session.py in _load_model(self, providers)
33
34 if isinstance(self._path_or_bytes, str):
---> 35 self._sess.load_model(self._path_or_bytes, providers)
36 elif isinstance(self._path_or_bytes, bytes):
37 self._sess.read_bytes(self._path_or_bytes, providers)

Fail: [ONNXRuntimeError] : 1 : FAIL : Load model from knn.onnx failed:Type Error: Type (tensor(float)) of output arg (variable) of node (Re_ReduceMean) does not match expected type (tensor(int64)).

training_labels = training_labels.ravel()
axis = 1

if training_labels.dtype == np.int32:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you handle string labels?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently I did not, that also means it is not covered by any unit test. I'll need to add more tests tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we don't have unit tests for that. Also, if you are updating the unit tests, can you make them use fit_classification_model() and fit_regression_model() from tests_helper.py.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All fixed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last failure comes from neighbours. When there neighbours are at the same exact distance, scikit-learn and onnx don't necessarily select the sames.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

@prabhat00155
Copy link
Contributor

Also, I see mismatches with integer features in KNN regressor:

X, y = make_regression(n_samples=1000, n_features=100, random_state=42)
X = X.astype(np.int64)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
model = KNeighborsRegressor().fit(X_train, y_train)
model_onnx = convert_sklearn(model, 'knn', [('input', Int64TensorType([None, X_test.shape[1]]))])
save_model(model_onnx, 'knn.onnx')
sess = InferenceSession('knn.onnx')
res = sess.run(None, {'input': np.array(X_test)})
np.mean(np.isclose(res[0], model.predict(X_test)))

0.73

@lgtm-com
Copy link

lgtm-com bot commented Nov 7, 2019

This pull request introduces 1 alert when merging 578e789 into dd9159c - view on LGTM.com

new alerts:

  • 1 for Unused import


if np.issubdtype(op.classes_.dtype, np.floating):
classes = op.classes_.astype(np.int32)
elif np.issubdtype(op.classes_.dtype, np.signedinteger):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both elif and else statements are same, you can just remove the elif part.

Converts *KNeighborsRegressor* into *ONNX*.
The converted model may return different predictions depending
on how the runtime select the topk element.
*sciki-learn* uses function `argpartition
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit

training_labels = training_labels.ravel()
axis = 1

if training_labels.dtype == np.int32:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

StrictVersion(onnxruntime.__version__) < StrictVersion("0.5.0"),
reason="not available")
def test_model_knn_classifier_multi_class_string(self):
model, X = self._fit_model_multiclass_classification(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to clean up this unit test and use utility functions defined in test_utils.py later.

@xadupre xadupre merged commit 7a68ad1 into onnx:master Nov 8, 2019
@xadupre xadupre deleted the knn branch November 14, 2019 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants