Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pandas.core.ops.dispatch_to_extension_op fails with UnboundLocalError #22414

Closed
xhochy opened this issue Aug 19, 2018 · 10 comments
Closed
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@xhochy
Copy link
Contributor

xhochy commented Aug 19, 2018

Code Sample, a copy-pastable example if possible

This bug requires an ExtensionArray that is not backed with a numpy array to occur.

import pandas as pd
import fletcher as fr

s = pd.Series(fr.FletcherArray(TEST_ARRAY))
assert (s.T == s).all()

Problem description

Looking at the code the value of new_right is only set when the the values of the ExtensionArray are a np.ndarray. With fletcher, this is not the case.

pandas/pandas/core/ops.py

Lines 1142 to 1156 in 8bb2cc1

if is_extension_array_dtype(left):
new_left = left.values
if isinstance(right, np.ndarray):
# handle numpy scalars, this is a PITA
# TODO(jreback)
new_right = lib.item_from_zerodim(right)
if is_scalar(new_right):
new_right = [new_right]
new_right = list(new_right)
elif is_extension_array_dtype(right) and type(left) != type(right):
new_right = list(new_right)
else:
new_right = right

Error message

>       assert (s.T == s).all()

tests/test_pandas_integration.py:139:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../pandas/pandas/core/ops.py:1433: in wrapper
    return dispatch_to_extension_op(op, self, other)
../pandas/pandas/core/ops.py:1163: in dispatch_to_extension_op
    res_values = op(new_left, new_right)
../pandas/pandas/core/ops.py:1433: in wrapper
    return dispatch_to_extension_op(op, self, other)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

op = <built-in function eq>, left = 0      Test
1    string
2      None
dtype: fletcher[string], right = <fletcher.base.FletcherArray object at 0x112ce7518>

    def dispatch_to_extension_op(op, left, right):
        """
        Assume that left or right is a Series backed by an ExtensionArray,
        apply the operator defined by op.
        """

        # The op calls will raise TypeError if the op is not defined
        # on the ExtensionArray
        # TODO(jreback)
        # we need to listify to avoid ndarray, or non-same-type extension array
        # dispatching

        if is_extension_array_dtype(left):

            new_left = left.values
            if isinstance(right, np.ndarray):

                # handle numpy scalars, this is a PITA
                # TODO(jreback)
                new_right = lib.item_from_zerodim(right)
                if is_scalar(new_right):
                    new_right = [new_right]
                new_right = list(new_right)
            elif is_extension_array_dtype(right) and type(left) != type(right):
>               new_right = list(new_right)
E               UnboundLocalError: local variable 'new_right' referenced before assignment

../pandas/pandas/core/ops.py:1154: UnboundLocalError
@gfyoung gfyoung added the ExtensionArray Extending pandas with custom dtypes or arrays. label Aug 19, 2018
@gfyoung
Copy link
Member

gfyoung commented Aug 19, 2018

@xhochy : Thanks for reporting this! Two things:

  • What error message do you get?
  • What do you expect the output to be?

cc @TomAugspurger

@xhochy
Copy link
Contributor Author

xhochy commented Aug 19, 2018

Added the exception including the traceback to the initial post.

My expected output would be that the __eq__ operation is called on one of the underlying ExtensionArrays.

@gfyoung gfyoung added the Bug label Aug 19, 2018
@gfyoung
Copy link
Member

gfyoung commented Aug 19, 2018

@xhochy : How to best to incorporate support for non-ndarray is TBD, but the UnboundedLocalError is an outright bug. new_right is not defined before we hit that condition! 😮

@xhochy
Copy link
Contributor Author

xhochy commented Aug 19, 2018

How to best to incorporate support for non-ndarray is TBD

Yes, this is a thing I need to write an example ExtensionArray for in the Pandas tests. I've been reporting these bugs but it is always hard to make reproducible tests. I hope to find enough time in a single block soon to write such an example ExtensionArray. Then it will be simpler to fix them with a reproducing test.

@TomAugspurger
Copy link
Contributor

IIRC, I have a fix for this in my big Sparse PR. I can split it off.

@jorisvandenbossche
Copy link
Member

but it is always hard to make reproducible tests

Basically we need a test ExtensionArray that does not use an ndarray as its backing values?
That should not be too hard, but what I don't fully understand is how this matters. Because pandas should in principle be unaware of how the values are stored (it does not know how to access the underlying values of an EA, because there is no consistent attribute for that).
(or is the fact that it matters now the bug?)

@xhochy
Copy link
Contributor Author

xhochy commented Aug 19, 2018

Because pandas should in principle be unaware of how the values are stored

Pandas looks quite often in its internal code into pd.Series.values. Sometimes it checks whether this is an ndarray and does different actions depending on that and in other cases, it is simply assumed values is an ndarray-like. While in principle this should not be done, this is done quite often and a source of bugs I expect to run into quite often. I don't expect that fletcher will be usable anytime in the next 1-3 months but my main motivation to work on it currently is to exactly discover these things in the pandas code base. I hope that helps and I hope that I can contribute some unit tests and/or fixes in that direction soon.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 19, 2018

Yes, but what you say is the distinction between ndarray and EA for Series.values (and this is, for now, unavoidable I think as long as not all values are stored with EA-like containers. And for sure there will still be bugs caused by this, that we need to fix).
But I thought you meant above the distinction between EA backed by ndarray and EA backed by other array. Because for simply the case of Series.values being an EA, I don't understand why it would be hard to write a test case, there should be something special about the EA you have that is not reflected by the EA test cases we have.

but my main motivation to work on it currently is to exactly discover these things in the pandas code base. I hope that helps

And for sure it does!

@TomAugspurger
Copy link
Contributor

This was fixed by #22479 (and perf will be improved by #23155)

@jorisvandenbossche
Copy link
Member

@xhochy BTW, in the meantime, there is a non-ndarray backed example ExtensionArray (based on a pyarrow array) that is used for testing (since this was a problem for you to write a test).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

4 participants