Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.combine() fails with ExtensionArray inside of Series #21183

Merged
merged 18 commits into from Jun 8, 2018
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v0.24.0.txt
Expand Up @@ -181,7 +181,8 @@ Reshaping
Other
^^^^^

-
- :meth:`Series.combine()` works correctly with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move the ExtensionArray to a sub-section (create a new one) as lots of changes in EA (you can make a new bug fix section).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made a bug fix section

- :meth:`Series.combine()` with scalar argument now works for any function type (:issue:`21248`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one can go in reshaping bug fixes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or into EA bug fixes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In EA bug fixes

-
-

23 changes: 18 additions & 5 deletions pandas/core/series.py
Expand Up @@ -2196,7 +2196,7 @@ def _binop(self, other, func, level=None, fill_value=None):
result.name = None
return result

def combine(self, other, func, fill_value=np.nan):
def combine(self, other, func, fill_value=None):
"""
Perform elementwise binary operation on two Series using given function
with optional fill value when an index is missing from one Series or
Expand All @@ -2208,6 +2208,8 @@ def combine(self, other, func, fill_value=np.nan):
func : function
Function that takes two scalars as inputs and return a scalar
fill_value : scalar value
The default specifies to use the appropriate NaN value for
the underlying dtype of the Series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need a versionchanged?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no change in behaviour for normal Series I think, as the na_value_for_dtypewill give NaN/NaT (which was the default before). It's only for extension arrays that it might give another value, depending on what the extension array defined its missing value to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jorisvandenbossche


Returns
-------
Expand All @@ -2227,20 +2229,31 @@ def combine(self, other, func, fill_value=np.nan):
Series.combine_first : Combine Series values, choosing the calling
Series's values first
"""
self_is_ext = is_extension_array_dtype(self.values)
if fill_value is None:
fill_value = na_value_for_dtype(self.dtype, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the keyword for False so it’s clearer what’s being specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that is done.


if isinstance(other, Series):
new_index = self.index.union(other.index)
new_name = ops.get_op_result_name(self, other)
new_values = np.empty(len(new_index), dtype=self.dtype)
for i, idx in enumerate(new_index):
new_values = []
for idx in new_index:
lv = self.get(idx, fill_value)
rv = other.get(idx, fill_value)
with np.errstate(all='ignore'):
new_values[i] = func(lv, rv)
new_values.append(func(lv, rv))
else:
new_index = self.index
with np.errstate(all='ignore'):
new_values = func(self._values, other)
new_values = [func(lv, other) for lv in self._values]
new_name = self.name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a comment on what is going on here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if self_is_ext and not is_categorical_dtype(self.values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am puzzled by all of these element-by-element operations. Is there a reason you don't simply define .combine on an EA array?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback combine is by definition an element-by-element operation. We do exactly the same for other dtypes, see 10 lines above.
This check here only tries to convert the scalars back to an ExtensionArray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The not is_categorical_dtype(self.values) signals again that we need a way to pass more information on the dtype to _from_sequence. xref #20747, but not for this PR!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't use self_is_ext simply use is_extension_type and write this out so its readable

if is_categorical_dtype(...):
    pass
elif is_extension_type(...):
   ....

why is the try/except here? that is very odd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the if test as you suggested.

@jreback wrote:

why is the try/except here? that is very odd

The idea is to first try to coerce to the same type as the extension dtype, and if that doesn't work, just call the regular constructor for the Series.

try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what kind of op hits the type error here? I find this a bit disconcerting that you need to catch a TypeError? is this a case that the combine op returns a result which is not an extension type (e.g. say its an int or someting), is that the reason? if so pls indicate via a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Here is an example:

In [1]: import pandas as pd

In [2]: from pandas.tests.extension.decimal.array import DecimalArray, make_dat
   ...: a

In [3]: s = pd.Series(DecimalArray(make_data())).head()

In [4]: s.combine(0.5, lambda x, y: x < y)
Out[4]:
0    False
1    False
2    False
3    False
4    False
dtype: bool

Since the function passed to combine is an arbitrary function, it could return a result of any type, which may not be the type that will fit in the EA.

I'll add a comment.

new_values = self._values._from_sequence(new_values)
except TypeError:
pass

return self._constructor(new_values, index=new_index, name=new_name)

def combine_first(self, other):
Expand Down
33 changes: 33 additions & 0 deletions pandas/tests/extension/base/methods.py
Expand Up @@ -103,3 +103,36 @@ def test_factorize_equivalence(self, data_for_grouping, na_sentinel):

tm.assert_numpy_array_equal(l1, l2)
self.assert_extension_array_equal(u1, u2)

def test_combine_le(self, data_repeated):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give a 1-liner explaining what this is testing. the name of the test is uninformative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# GH 20825
orig_data1, orig_data2 = data_repeated(2)
s1 = pd.Series(orig_data1)
s2 = pd.Series(orig_data2)
result = s1.combine(s2, lambda x1, x2: x1 <= x2)
expected = pd.Series([a <= b for (a, b) in
zip(list(orig_data1), list(orig_data2))])
self.assert_series_equal(result, expected)

val = s1.iloc[0]
result = s1.combine(val, lambda x1, x2: x1 <= x2)
expected = pd.Series([a <= val for a in list(orig_data1)])
self.assert_series_equal(result, expected)

def test_combine_add(self, data_repeated):
# GH 20825
orig_data1, orig_data2 = data_repeated(2)
s1 = pd.Series(orig_data1)
s2 = pd.Series(orig_data2)
result = s1.combine(s2, lambda x1, x2: x1 + x2)
expected = pd.Series(
orig_data1._from_sequence([a + b for (a, b) in
zip(list(orig_data1),
list(orig_data2))]))
self.assert_series_equal(result, expected)

val = s1.iloc[0]
result = s1.combine(val, lambda x1, x2: x1 + x2)
expected = pd.Series(
orig_data1._from_sequence([a + val for a in list(orig_data1)]))
self.assert_series_equal(result, expected)
13 changes: 13 additions & 0 deletions pandas/tests/extension/category/test_categorical.py
Expand Up @@ -29,6 +29,15 @@ def data_missing():
return Categorical([np.nan, 'A'])


@pytest.fixture
def data_repeated():
"""Return different versions of data for count times"""
def gen(count):
for _ in range(count):
yield Categorical(make_data())
yield gen


@pytest.fixture
def data_for_sorting():
return Categorical(['A', 'B', 'C'], categories=['C', 'A', 'B'],
Expand Down Expand Up @@ -154,6 +163,10 @@ class TestMethods(base.BaseMethodsTests):
def test_value_counts(self, all_data, dropna):
pass

@pytest.mark.skip(reason="combine add for categorical not supported")
def test_combine_add(self, data_repeated):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it raise? if so pls tests that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. Since combine by definition in the docs does an element by element operation, if you do something like:

In [1]: import pandas as pd                                                     
                                                                                
In [2]: cat1 = pd.Categorical(values=["one","two","three","three","two","one"], 
   ...:  categories=["one","two","three"], ordered=True)                        
                                                                                
In [3]: s1 = pd.Series(cat1)                                                    
                                                                                
In [4]: s1                                                                      
Out[4]:                                                                         
0      one                                                                      
1      two                                                                      
2    three                                                                      
3    three                                                                      
4      two                                                                      
5      one                                                                      
dtype: category                                                                 
Categories (3, object): [one < two < three]                                     
                                                                                
In [5]: s1.combine("abc", lambda x,y : x+y)                                     

In pandas 0.23.0, this fails. But the binary operation is defined on an element by element basis, so the new implementation will return:

Out[4]:
0      oneabc
1      twoabc
2    threeabc
3    threeabc
4      twoabc
5      oneabc
dtype: object

IMHO, this is correct, because of what combine is supposed to do. So I will test that.



class TestCasting(base.BaseCastingTests):
pass
9 changes: 9 additions & 0 deletions pandas/tests/extension/conftest.py
Expand Up @@ -30,6 +30,15 @@ def all_data(request, data, data_missing):
return data_missing


@pytest.fixture
def data_repeated():
"""Return different versions of data for count times"""
def gen(count):
for _ in range(count):
yield NotImplemented
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NotImplemented -> NotImplementedError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

yield gen


@pytest.fixture
def data_for_sorting():
"""Length-3 array with a known sort order.
Expand Down
4 changes: 3 additions & 1 deletion pandas/tests/extension/decimal/array.py
Expand Up @@ -28,7 +28,9 @@ class DecimalArray(ExtensionArray):
dtype = DecimalDtype()

def __init__(self, values):
assert all(isinstance(v, decimal.Decimal) for v in values)
for val in values:
if not isinstance(val, self.dtype.type):
raise TypeError
values = np.asarray(values, dtype=object)

self._data = values
Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/extension/decimal/test_decimal.py
Expand Up @@ -25,6 +25,14 @@ def data_missing():
return DecimalArray([decimal.Decimal('NaN'), decimal.Decimal(1)])


@pytest.fixture
def data_repeated():
def gen(count):
for _ in range(count):
yield DecimalArray(make_data())
yield gen


@pytest.fixture
def data_for_sorting():
return DecimalArray([decimal.Decimal('1'),
Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/extension/json/test_json.py
Expand Up @@ -187,6 +187,14 @@ def test_sort_values_missing(self, data_missing_for_sorting, ascending):
super(TestMethods, self).test_sort_values_missing(
data_missing_for_sorting, ascending)

@pytest.mark.skip(reason="combine for JSONArray not supported")
def test_combine_le(self, data_repeated):
pass

@pytest.mark.skip(reason="combine for JSONArray not supported")
def test_combine_add(self, data_repeated):
pass


class TestCasting(BaseJSON, base.BaseCastingTests):
@pytest.mark.xfail
Expand Down
13 changes: 13 additions & 0 deletions pandas/tests/series/test_combine_concat.py
Expand Up @@ -60,6 +60,19 @@ def test_append_duplicates(self):
with tm.assert_raises_regex(ValueError, msg):
pd.concat([s1, s2], verify_integrity=True)

def test_combine_scalar(self):
# GH 21248
# Note - combine() with another Series is tested elsewhere because
# it is used when testing operators
s = pd.Series([i * 10 for i in range(5)])
result = s.combine(3, lambda x, y: x + y)
expected = pd.Series([i * 10 + 3 for i in range(5)])
tm.assert_series_equal(result, expected)

result = s.combine(22, lambda x, y: min(x, y))
expected = pd.Series([min(i * 10, 22) for i in range(5)])
tm.assert_series_equal(result, expected)

def test_combine_first(self):
values = tm.makeIntIndex(20).values.astype(float)
series = Series(values, index=tm.makeIntIndex(20))
Expand Down