Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue using OneHotEncoder with drop="if_binary" #1017

Closed
dahlbaek opened this issue Aug 21, 2023 · 2 comments
Closed

Issue using OneHotEncoder with drop="if_binary" #1017

dahlbaek opened this issue Aug 21, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@dahlbaek
Copy link

dahlbaek commented Aug 21, 2023

We've hit what appears to be an interesting bug that triggers when one uses OneHotEncoder with drop="if_binary". It can be replicated by dropping the following test in https://github.com/onnx/sklearn-onnx/blob/main/tests/test_sklearn_one_hot_encoder_converter.py

    @unittest.skipIf(
        pv.Version(ort_version) <= pv.Version("0.4.0"), reason="issues with shapes"
    )
    @unittest.skipIf(
        not one_hot_encoder_supports_drop(),
        reason="OneHotEncoder does not support drop in scikit versions < 0.21",
    )
    def test_one_hot_encoder_drop_if_binary(self):
        data = [
            ["c0.4", "c0.2", 0],
            ["c1.4", "c1.2", 0],
            ["c0.2", "c2.2", 1],
            ["c0.2", "c2.2", 1],
            ["c0.2", "c2.2", 1],
            ["c0.2", "c2.2", 1],
        ]
        test = [["c0.2", "c2.2", 1]]
        model = OneHotEncoder(categories="auto", drop="if_binary")
        model.fit(data)
        inputs = [
            ("input1", StringTensorType([None, 2])),
            ("input2", Int64TensorType([None, 1])),
        ]
        model_onnx = convert_sklearn(
            model, "one-hot encoder", inputs, target_opset=TARGET_OPSET
        )
        self.assertTrue(model_onnx is not None)
        dump_data_and_model(
            test,
            model,
            model_onnx,
            verbose=False,
            basename="SklearnOneHotEncoderMixedStringIntDrop",
        )

Investigating the stack trace, one winds up here

indices_to_keep = np.delete(
np.arange(len(categories)), ohe_op.drop_idx_[index]
)

with ohe_op.drop_idx_ = [None, None, 0], and therefore np.delete(np.arange(3), None).

@dahlbaek
Copy link
Author

The test suite passes if one changes

indices_to_keep = np.delete(
np.arange(len(categories)), ohe_op.drop_idx_[index]
)

to

            if ohe_op.drop_idx_[index] is not None:
                indices_to_keep = np.delete(
                    np.arange(len(categories)), ohe_op.drop_idx_[index]
                )
            else:
                indices_to_keep = np.arange(len(categories))

@xadupre
Copy link
Collaborator

xadupre commented Oct 3, 2023

Closing this issue as the fix was merged into main branch.

@xadupre xadupre closed this as completed Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants