Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series.round behavior with ExtensionArrays #26730

Closed
ghost opened this issue Jun 8, 2019 · 17 comments
Closed

Series.round behavior with ExtensionArrays #26730

ghost opened this issue Jun 8, 2019 · 17 comments
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@ghost
Copy link

ghost commented Jun 8, 2019

@TomAugspurger, in your blog post Moral Philosophy for pandas or: What is .values? you wrote

In pandas 0.24, we'll (hopefully) have a good answer for what series.values is: a NumPy array or an ExtensionArray. For regular data types represented by NumPy, you'll get an ndarray. For extension types (implemented in pandas or elsewhere) you'll get an ExtensionArray. If you're using Series.values, you can rely on the set of methods common to each.

I just ran into an issue because this doesn't happen in some places.

pint is a popular package for working with units, useful for engineering/sci calculations. Recently they introduced an EA called pint-pandas for better interaction with pandas,
it's still a little rough, but very promising.

Here's an failure example:

In [2]: from pandas import Series 
   ...: import pint
   ...: from chemtools import Ch, UREG as ureg
   ...: from pint import UnitRegistry
   ...: 
   ...: ureg=UnitRegistry()
   ...: grams=ureg.g
   ...: 
   ...: s=Series((1.5*grams,2.5*grams),dtype='pint[g]')
   ...: s.round()
<...>
AttributeError: 'numpy.ndarray' object has no attribute 'rint'

The problem is that Series.round() calls com.values_from_object(self) instead of the EA-aware self.values.

pandas/pandas/core/series.py

Lines 2054 to 2085 in cf25c5c

def round(self, decimals=0, *args, **kwargs):
"""
Round each value in a Series to the given number of decimals.
Parameters
----------
decimals : int
Number of decimal places to round to (default: 0).
If decimals is negative, it specifies the number of
positions to the left of the decimal point.
Returns
-------
Series
Rounded values of the Series.
See Also
--------
numpy.around : Round values of an np.array.
DataFrame.round : Round values of a DataFrame.
Examples
--------
>>> s = pd.Series([0.1, 1.3, 2.7])
>>> s.round()
0 0.0
1 1.0
2 3.0
dtype: float64
"""
nv.validate_round(args, kwargs)
result = com.values_from_object(self).round(decimals)

com.values_from_object(self) percolates down to ExtensionBlock.to_dense():

def to_dense(self):
return np.asarray(self.values)

Which calls np.asarray(self.values), this convert the underlying PintArray in to a numpy array with object dtype, which is not numeric, and so numpy complains that rint isn't implemented. If Series.round() would invoked round() on self.values directly, the underlying EA implementation would get a chance to implement a suitable method, or delegate to numpy in an appropriate way.

You may want to make life easier for EA developers, by providing a default implementation for things like round(), I suppose it may work in some cases, but then there still needs to be a way for an EA to provide its own implementation of functions like round. Many Series methods use com.values_from_object(self), and its the the same problem everywhere.

cc @hgrecco (pint developer)

@TomAugspurger
Copy link
Contributor

I think NumPy's new __array_interface__ provides an elegant solution here. If an EA implements __array_interface__ then Series.round just needs to call np.round(self.array) and things will be fine.

In the meantime, I'm not sure what's best. I'm not sure that adding an ExtensionArray.round method is warranted.

@TomAugspurger
Copy link
Contributor

Sorry, I miswrote earlier. I said __array_interface__ when it should have been __array_function__: https://www.numpy.org/neps/nep-0018-array-function-protocol.html

In that case, Series.round simply becomes return np.round(self.array). If the backing array defines __array_function__ then they get to take control, otherwise NumPy's implementation (including coercing to an ndarray) takes over.

@ghost
Copy link
Author

ghost commented Jun 11, 2019

yeah, that looks like the right solution. Still requires changes on pandas side though, no?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 11, 2019 via email

@ghost
Copy link
Author

ghost commented Jun 11, 2019

I just coded it up, turned on experimental support in numpy 1.16.4 and it works (:balloon:).
s/com.values_from_object/self.values in Series.round doesn't seem to break any tests.

However, com.values_from_object basically just calls self.get_values() and the docstring says

Docstring: `get_values` is the same as values (but handles sparseness conversions); is a view.

and yet:

In [25]: s.get_values()
Out[25]: array([<Quantity(2.0, 'gram')>, <Quantity(3.0, 'gram')>], dtype=object)

In [26]: s.values
Out[26]: PintArray([2.0 gram, 3.0 gram], dtype='pint[gram]')

so maybe the get_values impl is out of date after the EA changes and needs to be fixed instead of changing the calling methods?

@TomAugspurger
Copy link
Contributor

so maybe the get_values impl is out of date after the EA changes and needs to be fixed instead of changing the calling methods?

Indeed, @jorisvandenbossche does your recent work on get_values affect Series.round?

@jorisvandenbossche
Copy link
Member

I think it is the docstring of get_values that is out of date, as it is assumed to always return some numpy array representation of the values.

does your recent work on get_values affect Series.round?

No, as that is using com.values_from_object. My work was to let that behave the same as it does today (it is used in too many places), but to refactor its implementation to no longer rely on get_values.

I agree the long term solution is that we want to use the numpy function protocol. In principle, I think we could already switch to np.round(self.array) for EAs now, to enable people to experiment with this.

@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jun 12, 2019
@jorisvandenbossche
Copy link
Member

The work I have been doing that Tom referenced is about getting rid of (first deprecating) get_values, exactly to avoid such confusion .. ;)
See #26409

com.values_from_object is used internally where we need numpy arrays.

get_values is a confusing mix of behaviour (and also different depending on the calling object (eg Series vs Block)), and moreover unneeded in the public API, so therefore I was trying to deprecate that. A main usage is currently in com.values_from_object, so part of the referenced PR is updating com.values_from_object to no longer rely on get_values.

get_values was not updated to handle the new EA, so now they are not compatible, and that's why this bug occurs.

The reason of the bug is not necessarily get_values not working with EAs (it is doing what it is expected to do, returning a numpy array), but that we simply never checked round with EAs, and so that was not updated to work with EAs.

@ghost
Copy link
Author

ghost commented Jun 12, 2019

That makes things clearer, thanks. So I understand get_values will eventually disappear.

The get_values docstring suggests it handles sparse arrays while values doesn't. Is that still the case?

@jorisvandenbossche
Copy link
Member

The get_values docstring suggests it handles sparse arrays while values doesn't. Is that still the case?

Yes, .values will return the SparseArray, while get_values returns the densified numpy array.

Similarly, for most EAs, .values will also return the EA and not a numpy array. But since this is not fully consistent, .values is no longer recommended but kept for backwards compatibility.

@ghost
Copy link
Author

ghost commented Jun 12, 2019

wait what. get_values is on its way out and .values is no longer recommended? what's left?

@jorisvandenbossche
Copy link
Member

See http://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#whatsnew-0240-values-api, which was kind of the follow-up on the blog post you mentioned in the first post.

Want an EA -> use .array. Want to get a numpy array -> use np.asarray(..) or .to_numpy()

@ghost
Copy link
Author

ghost commented Jun 12, 2019

I think we could already switch to np.round(self.array) for EAs now, to enable people to experiment with this.

ok. time for a PR?

@ghost
Copy link
Author

ghost commented Jun 12, 2019

aside, shoyer in numpy/numpy#13759 (comment) pointed out this section in NEP-18.

Quote:

With the current design, classes that implement __array_function__ to overload at least one function
implicitly declare an intent to implement the entire NumPy API. It’s not possible to implement only
np.concatenate() on a type, but fall back to NumPy’s default behavior of casting with np.asarray() 
for all other functions.

That's a lot of boilerplate burden on EA developers, which tom mentioned earlier.

@jorisvandenbossche jorisvandenbossche changed the title ExtensionArray: com.values_from_object considered harmful Support ExtensionArray in methods wrapping numpy functions (eg round) Jun 12, 2019
@jorisvandenbossche
Copy link
Member

But I suppose that you can still "fallback" inside __array_function__ to the numpy implementation by manually doing a np.asarray and calling the numpy function on that?

@TomAugspurger
Copy link
Contributor

If there's too much boilerplate for EA authors, we could perhaps provide a Mixin class implementing the fallback behavior.

@ghost ghost changed the title Support ExtensionArray in methods wrapping numpy functions (eg round) Series.round behavior with ExtensionArrays Jun 13, 2019
@ghost ghost closed this as completed Jul 8, 2019
@MichaelTiemannOSC
Copy link
Contributor

Ugh. I read through this thread 2x and followed a few of the links. Am I to understand that the solution is "don't use round"? Or is there a groovy idiomatic way to do so that I've simply missed?

cc @hgrecco (pint developer)

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants