Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shall we create a QuantityArray class? #1128

Open
hgrecco opened this issue Jul 1, 2020 · 15 comments
Open

Shall we create a QuantityArray class? #1128

hgrecco opened this issue Jul 1, 2020 · 15 comments

Comments

@hgrecco
Copy link
Owner

hgrecco commented Jul 1, 2020

I have just released Pint-Pandas 0.1 So first, let me publicly thank @andrewgsavage, @znicholls and @sigvaldm for the awesome work you have done to code, discuss and push this package. Congratulations!

I have also released Pint 0.14 to support a few last minute changes in Pint-Pandas.

Pint-Pandas has exposed (one more time) an old discussion: Do we need a QuantityScalar and QuantityArray classes? This was something on the table at the beginning of Pint. The conclusion was to avoid it for simplicity and we found a very nice way to do it without sacrifizing functionality. From very early releases we were able to support a large number of NumPy functions without requiering numpy and to have a nice way to write array quantities ([1, 2, 3] * ureg.meter). We also made the choice to support but doe not require numpy. I still think this was the right call and we should try to stay on that path.

Pint-Pandas has exposed a some cases in which having a QuantityArray seems to be unavoidable to fully support Pandas functionality with Pint. The question is ... shall we do it? If yes, how to do it in a way which is transparent to the normal user?

PS.- Thanks again to the people behind Pint-Pandas.

@znicholls
Copy link
Contributor

znicholls commented Jul 2, 2020

I read through #875, #764 and #753 yesterday and found these super helpful background resources. Unfortunately they made things seem very complicated and I couldn't see an obvious way forward.

The big conflict I saw was that, on the one hand, numpy's separation of scalar and array classes is seen as a 'design mistake', whereas on the other hand, not having the separation is causing headaches with Pandas (particularly with how Pandas decides whether something is iterable or not).

At risk of poking the hornet's nest, perhaps the simplest solution is to enquire in Pandas about whether https://github.com/pandas-dev/pandas/blob/4743cc91719146167eefb43865413fc0b0aef1d4/pandas/_libs/lib.pyx#L992 could be updated so that it checks ndim of any classes which have such an attribute, rather than only checking ndim of numpy arrays? (Only checking if the object is actually a numpy array seems an odd choice to me which restricts the ability of Pandas to be extended...)

@hgrecco
Copy link
Owner Author

hgrecco commented Jul 3, 2020

@znicholls Those discussions are indeed useful. I am pinging the people that were involved. @jthielen @crusaderky @shoyer @keewis

@hgrecco
Copy link
Owner Author

hgrecco commented Jul 3, 2020

Please feel free to invite more. Anyone with contacts in the pandas community?

@crusaderky
Copy link
Contributor

As discussed before, I am against splitting the class in two. I simply see no benefit.
Could you explain these cases with pandas that you believe can't fit in the current design?

@znicholls
Copy link
Contributor

znicholls commented Jul 5, 2020

@andrewgsavage has done much more work on this than me, but maybe these thoughts are helpful nonetheless...

The big problem is with how pandas decides whether something is iterable or not. As @hgrecco shows here, pandas determines whether something is_list_like here (code copied below).

cdef inline bint c_is_list_like(object obj, bint allow_sets) except -1:
    return (
        isinstance(obj, abc.Iterable)
        # we do not count strings/unicode/bytes as list-like
        and not isinstance(obj, (str, bytes))
        # exclude zero-dimensional numpy arrays, effectively scalars
        and not (util.is_array(obj) and obj.ndim == 0)
        # exclude sets if allow_sets is False
        and not (allow_sets is False and isinstance(obj, abc.Set))
    )

At the moment, all Quantity instances have an __iter__ method. It is designed to raise a TypeError if the quantity isn't actually iterable (code here).

The problem is that the __iter__ method is defined, regardless of whether the Quantity instance is actually iterable or not. This is a problem because it means that, in all cases, isinstance(quantity_instance, abc.Iterable) returns True (even if quantity_instance isn't actually iterable and would return a TypeError as soon as one tried to iterate with it).

A numpy array suffers from the same issue which is where the not (util.is_array(obj) and obj.ndim == 0) in Panda's is_list_like is important, because it excludes scalars. The problem is that util.is_array(obj) returns False if obj is anything other than np.ndarray. As a result, we can't just add an ndim attribute to Quantity because the obj.ndim == 0 will never be evaluated anyway.

So, the options seem to be:

  1. Split Quantity into scalar and array so that scalars don't have an __iter__ method and Pandas correctly recognises them as not being list-like.
  2. See if the Pandas API can be updated in some way so that zero-dimensional, non-numpy arrays are also correctly recognised as not being iterable.

I'm happy to open an issues in Pandas if people think that's a good option (it certainly looks the easier path to me).

@shoyer
Copy link

shoyer commented Jul 5, 2020 via email

@znicholls
Copy link
Contributor

Thanks @shoyer. Whilst others haven't commented yet, for reference I think the most relevant Pandas issue is pandas-dev/pandas#24702 but it's not the same as the one we're having here

@znicholls
Copy link
Contributor

I had a quick play to see what it would take to fix this. It looks like a simple change, but more experienced eyes would be very welcome! See draft PR here pandas-dev/pandas#35127

@hgrecco
Copy link
Owner Author

hgrecco commented Jul 5, 2020

I just want to add that magic methods are always looked up in the class (not in the instance) and therefore it is not possible to add it or remove it depending on the type of _magnitude

@andrewgsavage
Copy link
Collaborator

I'm happy to open an issues in Pandas if people think that's a good option (it certainly looks the easier path to me).

Ya that seems the easiest path and is preferable to splitting classes up, go for it!

@znicholls
Copy link
Contributor

Ok done in pandas-dev/pandas#35131 and pandas-dev/pandas#35127. I'm not sure if we need to ping anyone to get those moving?

@znicholls
Copy link
Contributor

Looks like they're moving. @andrewgsavage we got a question about places where is_list_like is used if you have the time/energy to answer (I'm not familiar enough with the code base to easily know where they are).

@crusaderky
Copy link
Contributor

We should remain aligned to the decision of the core numpy developers where they chose to use the same class for scalar and vectorized ndarrays.

Pandas should be changed to use numpy.iterable:

>>> class C:
...     def __iter__(self):
...         # Match behaviour of numpy.ndarray.__iter__
...         raise TypeError("iteration over scalar object")

>>> isinstance(C(), abc.Iterable)
True
>>> numpy.iterable(C())
False

@znicholls
Copy link
Contributor

znicholls commented Jul 7, 2020

Having fiddled around with pandas-dev/pandas#35127 a bit more, and tried using that branch with pint-pandas, it's clear that we'd need to update the behaviour of pandas' is_list_like, is_scalar and item_from_zerodim to get things working. I'm not really sure how things are meant to all behave, but as I see it the options are:

  1. update pandas' is_list_like, is_scalar and item_from_zerodim so that pint quantities would give the right behaviour (pros: don't have to change pint, cons: maybe the required updates would hurt pandas' performance too much given how much they rely on the C implementations of numpy? I don't know enough about these functions to comment on whether the changes would actually hurt performance or not)
  2. identify the cases in pandas where less-strict versions of is_list_like, is_scalar and item_from_zerodim would be required for pint-pandas to pass all the tests (pros: don't have to change pint, shouldn't hurt pandas performance too much, cons: requires some effort/dedication to make it happen if pandas devs aren't keen)
  3. update pint or pint-pandas so the quantities used in our pandas interface are sub-classes of numpy arrays (pros: don't have to update pandas, cons: I have no idea if this is even possible or desirable)

@burnpanck
Copy link

burnpanck commented Nov 25, 2021

While there seem to be some arguments against a split Quantity, there might be one in favour of it, related to __array__ and it's warnings. IMHO, for a units package, stripping units inside __array__ is the worst possible choice, even when it's emitting a warning. The preferred behaviour however differs between the proposed QuantityScalar and QuantityArray: The former should not have __array__ at all, the latter should raise.

Why shouldn't we have __array__ on QuantityScalar? Because it prevents the reliable creation of dtype=object arrays of scalar quantities. Consider the following 4 cases, assuming a scalar quantity q=QuantityScalar(1.):

  1. np.asarray(q): We expect array(QuantityScalar(1.), dtype=object). If __array__ raises, then the call fails, if __array__ succeeds, then we get array(1., dtype=float).
  2. np.asarray([q]): We expect array([QuantityScalar(1.)], dtype=object). If __array__ raises, then the call fails, if __array__ succeeds, then numpy proceeds to call float(q), leading to array([1.], dtype=float) if float(q) succeeds too.
  3. np.asarray(q,dtype=object): We expect array(QuantityScalar(1.), dtype=object). If __array__ raises, then the call fails, if __array__ succeeds, then we get array(1., dtype=object).
  4. np.asarray([q],dtype=object): We expect array([QuantityScalar(1.)], dtype=object). If __array__ raises, then the call fails, if __array__ succeeds, then we do actually get the expected output!

Note in particular that a failure in __array__ causes np.asarray to fail, even though the result is in fact discarded (except for case 1) - I presume that numpy needs to checks the dimensionality of the result to determine the shape of the resulting array. Because of this behaviour, turning UnitStrippedWarning into errors will prevent all code from working that relies on object arrays of quantity scalars - something that is exceedingly common within (pint-)pandas.

To clarify, if QuantityScalar has no __array__ we get the expected output in all cases. For QuantityArray, raising is probably the correct behaviour.

Here is a bit of code to verify those findings above:

class MockArray:
    def __init__(self, values):
        self._values = values

    def __repr__(self):
        return f"{type(self).__name__}({self._values})"
    
class MockArrayStrip(MockArray):
    def __array__(self, t=None):
        return np.asarray(self._values)

class MockArrayRaise(MockArray):
    def __array__(self, t=None):
        raise TypeError("forbidden")


for mock in [
    MockArray(1.),
    MockArrayStrip(1.),
    MockArrayRaise(1.),
]:
    for orig in [mock, [mock]]:
        for dt in [None, "float", "object"]:
            try:
                out = np.asarray(orig,dtype=dt)
            except Exception as ex:
                out = str(ex)
            print(f"np.asarray({orig!r:21},dtype={dt!r:8}) -> {out!r}")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants