Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.combine() fails with ExtensionArray inside of Series #21183

Merged
merged 18 commits into from Jun 8, 2018
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.1.txt
Expand Up @@ -47,6 +47,7 @@ Bug Fixes
~~~~~~~~~

- tab completion on :class:`Index` in IPython no longer outputs deprecation warnings (:issue:`21125`)
- :meth:`Series.combine()` no longer fails with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.24

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved it to 0.24


Groupby/Resample/Rolling
^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
11 changes: 11 additions & 0 deletions pandas/core/arrays/base.py
Expand Up @@ -610,3 +610,14 @@ def _ndarray_values(self):
used for interacting with our indexers.
"""
return np.array(self)

# ------------------------------------------------------------------------
# Utilities for use by subclasses
# ------------------------------------------------------------------------
def is_sequence_of_dtype(self, seq):
"""
Given a sequence, determine whether all members have the appropriate
type for this instance of an ExtensionArray
"""
thistype = self.dtype.type
return all(isinstance(i, thistype) for i in seq)
48 changes: 39 additions & 9 deletions pandas/core/series.py
Expand Up @@ -2185,18 +2185,32 @@ def _binop(self, other, func, level=None, fill_value=None):

this_vals, other_vals = ops.fill_binop(this.values, other.values,
fill_value)

with np.errstate(all='ignore'):
result = func(this_vals, other_vals)
name = ops.get_op_result_name(self, other)

if is_extension_array_dtype(this) or is_extension_array_dtype(other):
try:
result = func(this_vals, other_vals)
except TypeError:
result = NotImplemented

if result is NotImplemented:
result = [func(a, b) for a, b in zip(this_vals, other_vals)]
if is_extension_array_dtype(this):
excons = type(this_vals)._from_sequence
else:
excons = type(other_vals)._from_sequence
result = excons(result)
else:
with np.errstate(all='ignore'):
result = func(this_vals, other_vals)
result = self._constructor(result, index=new_index, name=name)
result = result.__finalize__(self)
if name is None:
# When name is None, __finalize__ overwrites current name
result.name = None
return result

def combine(self, other, func, fill_value=np.nan):
def combine(self, other, func, fill_value=None):
"""
Perform elementwise binary operation on two Series using given function
with optional fill value when an index is missing from one Series or
Expand All @@ -2208,6 +2222,9 @@ def combine(self, other, func, fill_value=np.nan):
func : function
Function that takes two scalars as inputs and return a scalar
fill_value : scalar value
The default specifies to use np.nan unless self is
backed by ExtensionArray, in which case the ExtensionArray
na_value is used.

Returns
-------
Expand All @@ -2227,20 +2244,33 @@ def combine(self, other, func, fill_value=np.nan):
Series.combine_first : Combine Series values, choosing the calling
Series's values first
"""
self_is_ext = is_extension_array_dtype(self)
if fill_value is None:
if self_is_ext:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thikn na_value_for_dtype is appropriate here, probably with compat=False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made that change.

fill_value = self.dtype.na_value
else:
fill_value = np.nan
if isinstance(other, Series):
new_index = self.index.union(other.index)
new_name = ops.get_op_result_name(self, other)
new_values = np.empty(len(new_index), dtype=self.dtype)
for i, idx in enumerate(new_index):
new_values = []
for idx in new_index:
lv = self.get(idx, fill_value)
rv = other.get(idx, fill_value)
with np.errstate(all='ignore'):
new_values[i] = func(lv, rv)
new_values.append(func(lv, rv))
else:
new_index = self.index
with np.errstate(all='ignore'):
new_values = func(self._values, other)
if not self_is_ext:
with np.errstate(all='ignore'):
new_values = func(self._values, other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't really understand (but this is related with the current implementation, not your changes) why we don't do it element-wise here (no loop over the values as is the case if other is Series).

For me, this seems like a bug in the current implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche You're correct. I created a new issue #21248 . I will fix that here.

else:
new_values = [func(lv, other) for lv in self._values]
new_name = self.name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a comment on what is going on here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (self_is_ext and self.values.is_sequence_of_dtype(new_values)):
new_values = self._values._from_sequence(new_values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what conditions is self.values.is_sequence_of_dtype(new_values) false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine takes a func as an argument. The func may not necessarily return objects of the ExtensionDtype. For example, consider a logical operator. So here self.values is the original ExtensionArray, but new_values is the result of applying the func element by element, and those individual results aren't necessarily of the corresponding ExtensionDtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

How important is it to allow coercion of output type? The previous code certainly considered dtype-preserving functions to be the expected case, since the pre-allocated new_dtype had the same dtype as self. Only if necessary was it coerced.

Without having studied the uses of combine, can we say "extension arrays are not allowed to upcast in combine?". i.e. Series[ExtensionArray].combine(Series[Any]) -> Series[ExtensionArray]. That doesn't sound quiet right...

Anyway, my aversion is to having to perform a full scan of the data just to determine the dtype. That's what types are for :) Can we ask the user to provide an output_dtype argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following (using the implementation in this PR):

In [1]: import pandas as pd
   ...: from pandas.tests.extension.decimal.array import DecimalArray, make_dat
   ...: a
   ...:

In [2]: da1= make_data()
   ...: da2= make_data()
   ...: da1[:5], da2[:5]
   ...:
Out[2]:
([Decimal('0.9100374922132099531069115982973016798496246337890625'),
  Decimal('0.559003605365540945371094494475983083248138427734375'),
  Decimal('0.31951366993722529752375294265220873057842254638671875'),
  Decimal('0.238154945978455767630066475248895585536956787109375'),
  Decimal('0.5851317119327676952167394119896925985813140869140625')],
 [Decimal('0.12525543165059660477567149428068660199642181396484375'),
  Decimal('0.68213474143447905273518472313298843801021575927734375'),
  Decimal('0.29982507800002611286771525556105189025402069091796875'),
  Decimal('0.297029189226840184545608281041495501995086669921875'),
  Decimal('0.5969224093736389402664599401759915053844451904296875')])

In [3]: s1 = pd.Series(DecimalArray(da1))
   ...: s2 = pd.Series(DecimalArray(da2))
   ...: pd.DataFrame({'s1':s1, 's2':s2}).head()
   ...:
Out[3]:
                                                  s1
                     s2
0  0.91003749221320995310691159829730167984962463...  0.12525543165059660477567149428068660199642181...
1  0.55900360536554094537109449447598308324813842...  0.68213474143447905273518472313298843801021575...
2  0.31951366993722529752375294265220873057842254...  0.29982507800002611286771525556105189025402069...
3  0.23815494597845576763006647524889558553695678...  0.29702918922684018454560828104149550199508666...
4  0.58513171193276769521673941198969259858131408...  0.59692240937363894026645994017599150538444519...

In [4]: s1.dtype, s2.dtype
Out[4]:
(<pandas.tests.extension.decimal.array.DecimalDtype at 0x15bc5ef1048>,
 <pandas.tests.extension.decimal.array.DecimalDtype at 0x15bc5ef1048>)

In [5]: cores = s1.combine(s2, lambda x1, x2: x1 if x1 < x2 else x2)
   ...: cores.head()
   ...:
Out[5]:
0    0.12525543165059660477567149428068660199642181...
1    0.55900360536554094537109449447598308324813842...
2    0.29982507800002611286771525556105189025402069...
3    0.23815494597845576763006647524889558553695678...
4    0.58513171193276769521673941198969259858131408...
dtype: decimal

In [6]: clres = s1.combine(s2, lambda x1, x2: (x1 < x2))
   ...: clres.head()
   ...:
Out[6]:
0    False
1     True
2    False
3     True
4     True
dtype: bool

Note that with the implementation as in this PR, we get a Series of dtype decimal since the results are all Decimal objects, and we get a boolean Series when the results are logical.

The previous behavior would product object, which is not what you want. Given a sequence s, if you do pd.Series(s), the internals of pandas look at the sequence s to determine the dtype. But those internals won't pick things up correctly for things that are ExtensionDtype. So somehow, we need to know to call ExtensionArray._from_sequence() to get the result of combine into the right type.

The implementation is already doing this element-by-element, so we are doing a full scan of both the left and right arrays. This is an extra scan on the result.

We can add an output_dtype argument to combine, but then we'd be changing the API of combine and would need to decide what the default value of that parameter is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not do this here at all, prefering instead to dispatch to the EA itself for a .combine() method, this will be much simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing the is_sequence_of_dtype check, an alternative might be to try: .. except _from_sequence()?

I would prefer not to add combine logic to the extension array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche My most recent commit has this change you suggested and removes is_sequence_of_dtype. It required an extra check to see if it is categorical. I also added a test which revealed a couple of other bugs.

In terms of whether combine should be in the EA class or in series.py, you guys will have to decide on that.


return self._constructor(new_values, index=new_index, name=new_name)

def combine_first(self, other):
Expand Down
14 changes: 14 additions & 0 deletions pandas/tests/extension/category/test_categorical.py
Expand Up @@ -2,6 +2,9 @@

import pytest
import numpy as np
import pandas as pd

import pandas.util.testing as tm

from pandas.api.types import CategoricalDtype
from pandas import Categorical
Expand Down Expand Up @@ -154,6 +157,17 @@ class TestMethods(base.BaseMethodsTests):
def test_value_counts(self, all_data, dropna):
pass

def test_combine(self):
# GH 20825
orig_data1 = make_data()
orig_data2 = make_data()
s1 = pd.Series(Categorical(orig_data1, ordered=True))
s2 = pd.Series(Categorical(orig_data2, ordered=True))
result = s1.combine(s2, lambda x1, x2: x1 <= x2)
expected = pd.Series([a <= b for (a, b) in
zip(orig_data1, orig_data2)])
tm.assert_series_equal(result, expected)


class TestCasting(base.BaseCastingTests):
pass
11 changes: 11 additions & 0 deletions pandas/tests/extension/decimal/test_decimal.py
Expand Up @@ -138,6 +138,17 @@ def test_value_counts(self, all_data, dropna):

tm.assert_series_equal(result, expected)

def test_combine(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put this in the base class? (and then if needed skip it for the json tests)

And make use of the fixtures? (so passing all_data as an argument to the test function, instead of calling DecimalArray(make_data()) in the test function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That took a bit of work to do, since I need two vectors that are different. Found discussion at pytest-dev/pytest#2703 about how to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's clever .. :-)

# GH 20825
orig_data1 = make_data()
orig_data2 = make_data()
s1 = pd.Series(orig_data1)
s2 = pd.Series(orig_data2)
result = s1.combine(s2, lambda x1, x2: x1 <= x2)
expected = pd.Series([a <= b for (a, b) in
zip(orig_data1, orig_data2)])
tm.assert_series_equal(result, expected)


class TestCasting(BaseDecimal, base.BaseCastingTests):
pass
Expand Down