Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.combine() fails with ExtensionArray inside of Series #21183

Merged
merged 18 commits into from Jun 8, 2018

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented May 23, 2018

This is a split of #20889 to just fix the combine issue.

@jschendel jschendel added Bug ExtensionArray Extending pandas with custom dtypes or arrays. labels May 23, 2018
@@ -47,6 +47,7 @@ Bug Fixes
~~~~~~~~~

- tab completion on :class:`Index` in IPython no longer outputs deprecation warnings (:issue:`21125`)
- Series.combine() no longer fails with ExtensionArray inside of Series (:issue:`20825`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Series.combine() --> :meth:`Series.combine`
  • ExtensionArray --> :class:`~pandas.api.extensions.ExtensionArray`
  • Series --> :class:`Series`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

except TypeError:
result = NotImplemented
except Exception as e:
raise e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this last except block is redundant, since non-TypeError exceptions will be raised regardless of if it that block is present or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

if isinstance(other, Series):
new_index = self.index.union(other.index)
new_name = ops.get_op_result_name(self, other)
new_values = np.empty(len(new_index), dtype=self.dtype)
new_values = []
for i, idx in enumerate(new_index):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the enumerate is no longer necessary since i isn't used, so you should be able to directly iterate over new_index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -157,3 +160,13 @@ def test_value_counts(self, all_data, dropna):

class TestCasting(base.BaseCastingTests):
pass


def test_combine():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to be within the TestMethods class? And can you reference the GitHub issue number in a comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@codecov
Copy link

codecov bot commented May 23, 2018

Codecov Report

Merging #21183 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21183      +/-   ##
==========================================
+ Coverage   91.85%   91.85%   +<.01%     
==========================================
  Files         153      153              
  Lines       49549    49573      +24     
==========================================
+ Hits        45512    45537      +25     
+ Misses       4037     4036       -1
Flag Coverage Δ
#multiple 90.25% <100%> (ø) ⬆️
#single 41.85% <7.14%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/series.py 94.17% <100%> (+0.05%) ⬆️
pandas/core/reshape/melt.py 97.34% <0%> (ø) ⬆️
pandas/core/indexes/datetimes.py 95.8% <0%> (+0.02%) ⬆️
pandas/io/formats/style.py 96.12% <0%> (+0.08%) ⬆️
pandas/core/dtypes/missing.py 92.52% <0%> (+0.57%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c65c57...4ca28b2. Read the comment docs.

new_name = self.name

if (self_is_ext and self.values.is_sequence_of_dtype(new_values)):
new_values = self._values._from_sequence(new_values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what conditions is self.values.is_sequence_of_dtype(new_values) false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

combine takes a func as an argument. The func may not necessarily return objects of the ExtensionDtype. For example, consider a logical operator. So here self.values is the original ExtensionArray, but new_values is the result of applying the func element by element, and those individual results aren't necessarily of the corresponding ExtensionDtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

How important is it to allow coercion of output type? The previous code certainly considered dtype-preserving functions to be the expected case, since the pre-allocated new_dtype had the same dtype as self. Only if necessary was it coerced.

Without having studied the uses of combine, can we say "extension arrays are not allowed to upcast in combine?". i.e. Series[ExtensionArray].combine(Series[Any]) -> Series[ExtensionArray]. That doesn't sound quiet right...

Anyway, my aversion is to having to perform a full scan of the data just to determine the dtype. That's what types are for :) Can we ask the user to provide an output_dtype argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the following (using the implementation in this PR):

In [1]: import pandas as pd
   ...: from pandas.tests.extension.decimal.array import DecimalArray, make_dat
   ...: a
   ...:

In [2]: da1= make_data()
   ...: da2= make_data()
   ...: da1[:5], da2[:5]
   ...:
Out[2]:
([Decimal('0.9100374922132099531069115982973016798496246337890625'),
  Decimal('0.559003605365540945371094494475983083248138427734375'),
  Decimal('0.31951366993722529752375294265220873057842254638671875'),
  Decimal('0.238154945978455767630066475248895585536956787109375'),
  Decimal('0.5851317119327676952167394119896925985813140869140625')],
 [Decimal('0.12525543165059660477567149428068660199642181396484375'),
  Decimal('0.68213474143447905273518472313298843801021575927734375'),
  Decimal('0.29982507800002611286771525556105189025402069091796875'),
  Decimal('0.297029189226840184545608281041495501995086669921875'),
  Decimal('0.5969224093736389402664599401759915053844451904296875')])

In [3]: s1 = pd.Series(DecimalArray(da1))
   ...: s2 = pd.Series(DecimalArray(da2))
   ...: pd.DataFrame({'s1':s1, 's2':s2}).head()
   ...:
Out[3]:
                                                  s1
                     s2
0  0.91003749221320995310691159829730167984962463...  0.12525543165059660477567149428068660199642181...
1  0.55900360536554094537109449447598308324813842...  0.68213474143447905273518472313298843801021575...
2  0.31951366993722529752375294265220873057842254...  0.29982507800002611286771525556105189025402069...
3  0.23815494597845576763006647524889558553695678...  0.29702918922684018454560828104149550199508666...
4  0.58513171193276769521673941198969259858131408...  0.59692240937363894026645994017599150538444519...

In [4]: s1.dtype, s2.dtype
Out[4]:
(<pandas.tests.extension.decimal.array.DecimalDtype at 0x15bc5ef1048>,
 <pandas.tests.extension.decimal.array.DecimalDtype at 0x15bc5ef1048>)

In [5]: cores = s1.combine(s2, lambda x1, x2: x1 if x1 < x2 else x2)
   ...: cores.head()
   ...:
Out[5]:
0    0.12525543165059660477567149428068660199642181...
1    0.55900360536554094537109449447598308324813842...
2    0.29982507800002611286771525556105189025402069...
3    0.23815494597845576763006647524889558553695678...
4    0.58513171193276769521673941198969259858131408...
dtype: decimal

In [6]: clres = s1.combine(s2, lambda x1, x2: (x1 < x2))
   ...: clres.head()
   ...:
Out[6]:
0    False
1     True
2    False
3     True
4     True
dtype: bool

Note that with the implementation as in this PR, we get a Series of dtype decimal since the results are all Decimal objects, and we get a boolean Series when the results are logical.

The previous behavior would product object, which is not what you want. Given a sequence s, if you do pd.Series(s), the internals of pandas look at the sequence s to determine the dtype. But those internals won't pick things up correctly for things that are ExtensionDtype. So somehow, we need to know to call ExtensionArray._from_sequence() to get the result of combine into the right type.

The implementation is already doing this element-by-element, so we are doing a full scan of both the left and right arrays. This is an extra scan on the result.

We can add an output_dtype argument to combine, but then we'd be changing the API of combine and would need to decide what the default value of that parameter is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not do this here at all, prefering instead to dispatch to the EA itself for a .combine() method, this will be much simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of introducing the is_sequence_of_dtype check, an alternative might be to try: .. except _from_sequence()?

I would prefer not to add combine logic to the extension array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche My most recent commit has this change you suggested and removes is_sequence_of_dtype. It required an extra check to see if it is categorical. I also added a test which revealed a couple of other bugs.

In terms of whether combine should be in the EA class or in series.py, you guys will have to decide on that.

@@ -41,7 +41,7 @@ def win_types(request):
return request.param


@pytest.fixture(params=['kaiser', 'gaussian', 'general_gaussian', 'slepian'])
@pytest.fixture(params=['kaiser', 'gaussian', 'general_gaussian'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge conflict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something went wrong with git. I didn't touch that code, but did a rebase, and I think it pulled that in.

@@ -1118,10 +1118,12 @@ def assert_extension_array_equal(left, right):
right_na = right.isna()
assert_numpy_array_equal(left_na, right_na)

left_valid = left[~left_na].astype(object)
right_valid = right[~right_na].astype(object)
if len(left_na) > 0 and len(right_na) > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the changes here? left_na is an array of booleans, so it'll always be the same length as left.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If left has length zero, then the statement that follows, i.e.

left_valid = left[~left_na].astype(object)

fails. There was some test case that was failing without this change (but I don't remember which one, and it may have been in the ops stuff). I will leave this change out and see if this fix for combine still works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this works

In [9]: pd.Series()[np.array([], dtype='bool')].astype(object)
Out[9]: Series([], dtype: object)

then we should ensure that Series[extension_array] works too. If you find a reproducible example, could you make a new issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this code out of this PR. It's not needed here. When we get to the ops stuff, I'll have to reintroduce it, and figure out the boundary case that required this particular case.

@@ -1224,6 +1226,9 @@ def assert_series_equal(left, right, check_dtype=True,
left = pd.IntervalIndex(left)
right = pd.IntervalIndex(right)
assert_index_equal(left, right, obj='{obj}.index'.format(obj=obj))
elif (is_extension_array_dtype(left) and not is_categorical_dtype(left) and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the not is_categorical_dtype needed?

Copy link
Contributor Author

@Dr-Irv Dr-Irv May 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was left over from the ops stuff, so I have removed it.

@pep8speaks
Copy link

pep8speaks commented May 23, 2018

Hello @Dr-Irv! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on June 07, 2018 at 15:45 Hours UTC

@@ -47,6 +47,7 @@ Bug Fixes
~~~~~~~~~

- tab completion on :class:`Index` in IPython no longer outputs deprecation warnings (:issue:`21125`)
- :meth:`Series.combine()` no longer fails with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to 0.24

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved it to 0.24

new_name = self.name

if (self_is_ext and self.values.is_sequence_of_dtype(new_values)):
new_values = self._values._from_sequence(new_values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not do this here at all, prefering instead to dispatch to the EA itself for a .combine() method, this will be much simpler.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented May 24, 2018

@jreback You write:

I would rather not do this here at all, prefering instead to dispatch to the EA itself for a .combine() method, this will be much simpler.

Are you suggesting that there is a default implementation of .combine() in ExtensionArray that can be overridden by implementers of the EA subclass, and that Series.combine() dispatch to that? If so, that sounds reasonable to me, and I can make that change.

@jreback
Copy link
Contributor

jreback commented May 24, 2018

Are you suggesting that there is a default implementation of .combine() in ExtensionArray that can be overridden by implementers of the EA subclass, and that Series.combine() dispatch to that? If so, that sounds reasonable to me, and I can make that change.

no I would suggest the default impl is NotImplemented, this is getting too messy here with all of these if/thens

name = ops.get_op_result_name(self, other)

if (is_extension_array_dtype(this_vals) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stepping back a bit here, why does ExtensionArray get a fallback to [func(a, b) for a, b in zip(...)], when ndarray doesn't? This confuses the expected signature for func: does it that two arrays a and b, or two scalars?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.
And I also don't really understand how this is related to combine, as _binop does not seem to be used there. I think _binop is for implementing the flex arithmetic ops like .add() etc. But for that I think we should do this together with the actual ops change, and probably end up with the same conclusion (no explicit element-wise fallback but rely on the ExtensionArray implementation to do this)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree here. I am puzzled why this change at all. If the extension array supports ops this would work as is, if not it would raise a TypeError. So not sure what you are trying to do here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake. When I separated out this PR from #20889, I should have put back the old _binop. Fixed in my next commit.

@@ -2227,20 +2245,36 @@ def combine(self, other, func, fill_value=np.nan):
Series.combine_first : Combine Series values, choosing the calling
Series's values first
"""
self_is_ext = is_extension_array_dtype(self.values)
if fill_value is None:
if self_is_ext:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thikn na_value_for_dtype is appropriate here, probably with compat=False.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made that change.

new_values = func(self._values, other)
if not self_is_ext:
with np.errstate(all='ignore'):
new_values = func(self._values, other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't really understand (but this is related with the current implementation, not your changes) why we don't do it element-wise here (no loop over the values as is the case if other is Series).

For me, this seems like a bug in the current implementation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche You're correct. I created a new issue #21248 . I will fix that here.

@@ -174,6 +174,7 @@ Reshaping
Other
^^^^^

- :meth:`Series.combine()` no longer fails with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be positive. say it works!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

name = ops.get_op_result_name(self, other)

if (is_extension_array_dtype(this_vals) or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree here. I am puzzled why this change at all. If the extension array supports ops this would work as is, if not it would raise a TypeError. So not sure what you are trying to do here.

new_name = self.name

if self_is_ext and not is_categorical_dtype(self.values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am puzzled by all of these element-by-element operations. Is there a reason you don't simply define .combine on an EA array?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback combine is by definition an element-by-element operation. We do exactly the same for other dtypes, see 10 lines above.
This check here only tries to convert the scalars back to an ExtensionArray.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks much cleaner!

new_name = self.name

if self_is_ext and not is_categorical_dtype(self.values):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The not is_categorical_dtype(self.values) signals again that we need a way to pass more information on the dtype to _from_sequence. xref #20747, but not for this PR!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't use self_is_ext simply use is_extension_type and write this out so its readable

if is_categorical_dtype(...):
    pass
elif is_extension_type(...):
   ....

why is the try/except here? that is very odd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the if test as you suggested.

@jreback wrote:

why is the try/except here? that is very odd

The idea is to first try to coerce to the same type as the extension dtype, and if that doesn't work, just call the regular constructor for the Series.

@@ -138,6 +138,22 @@ def test_value_counts(self, all_data, dropna):

tm.assert_series_equal(result, expected)

def test_combine(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put this in the base class? (and then if needed skip it for the json tests)

And make use of the fixtures? (so passing all_data as an argument to the test function, instead of calling DecimalArray(make_data()) in the test function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That took a bit of work to do, since I need two vectors that are different. Found discussion at pytest-dev/pytest#2703 about how to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's clever .. :-)

"""Return different versions of data for count times"""
def gen(count):
for _ in range(count):
yield NotImplemented
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NotImplemented -> NotImplementedError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -2227,20 +2229,31 @@ def combine(self, other, func, fill_value=np.nan):
Series.combine_first : Combine Series values, choosing the calling
Series's values first
"""
self_is_ext = is_extension_array_dtype(self.values)
if fill_value is None:
fill_value = na_value_for_dtype(self.dtype, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the keyword for False so it’s clearer what’s being specified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that is done.

@TomAugspurger
Copy link
Contributor

Lgtm otherwise.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly minor comments. looks pretty good. this is a nice target changed.

new_name = self.name

if self_is_ext and not is_categorical_dtype(self.values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't use self_is_ext simply use is_extension_type and write this out so its readable

if is_categorical_dtype(...):
    pass
elif is_extension_type(...):
   ....

why is the try/except here? that is very odd

@@ -154,6 +163,10 @@ class TestMethods(base.BaseMethodsTests):
def test_value_counts(self, all_data, dropna):
pass

@pytest.mark.skip(reason="combine add for categorical not supported")
def test_combine_add(self, data_repeated):
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it raise? if so pls tests that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. Since combine by definition in the docs does an element by element operation, if you do something like:

In [1]: import pandas as pd                                                     
                                                                                
In [2]: cat1 = pd.Categorical(values=["one","two","three","three","two","one"], 
   ...:  categories=["one","two","three"], ordered=True)                        
                                                                                
In [3]: s1 = pd.Series(cat1)                                                    
                                                                                
In [4]: s1                                                                      
Out[4]:                                                                         
0      one                                                                      
1      two                                                                      
2    three                                                                      
3    three                                                                      
4      two                                                                      
5      one                                                                      
dtype: category                                                                 
Categories (3, object): [one < two < three]                                     
                                                                                
In [5]: s1.combine("abc", lambda x,y : x+y)                                     

In pandas 0.23.0, this fails. But the binary operation is defined on an element by element basis, so the new implementation will return:

Out[4]:
0      oneabc
1      twoabc
2    threeabc
3    threeabc
4      twoabc
5      oneabc
dtype: object

IMHO, this is correct, because of what combine is supposed to do. So I will test that.

@@ -103,3 +103,36 @@ def test_factorize_equivalence(self, data_for_grouping, na_sentinel):

tm.assert_numpy_array_equal(l1, l2)
self.assert_extension_array_equal(u1, u2)

def test_combine_le(self, data_repeated):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give a 1-liner explaining what this is testing. the name of the test is uninformative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -2208,6 +2208,8 @@ def combine(self, other, func, fill_value=np.nan):
func : function
Function that takes two scalars as inputs and return a scalar
fill_value : scalar value
The default specifies to use the appropriate NaN value for
the underlying dtype of the Series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need a versionchanged?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no change in behaviour for normal Series I think, as the na_value_for_dtypewill give NaN/NaT (which was the default before). It's only for extension arrays that it might give another value, depending on what the extension array defined its missing value to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jorisvandenbossche

new_name = self.name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put a comment on what is going on here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jreback jreback added this to the 0.24.0 milestone Jun 5, 2018
@jreback
Copy link
Contributor

jreback commented Jun 5, 2018

@Dr-Irv also can you make sure that the cases in your OP are covered here (I think so, but pls verify)

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 6, 2018

@jreback wrote:

@Dr-Irv also can you make sure that the cases in your OP are covered here (I think so, but pls verify)

Here you go:

In [1]: import pandas as pd

In [2]: from pandas.tests.extension.decimal.array import DecimalArray, make_dat
   ...: a

In [3]: da1= make_data()

In [4]: da2 = make_data()

In [5]: s1 = pd.Series(DecimalArray(da1))

In [6]: s2 = pd.Series(DecimalArray(da2))

In [7]: s1.combine(s2, lambda x1, x2: x1 if x1 < x2 else x2)
Out[7]:
0     0.08098492909619514623642544393078424036502838...
1     0.38231358539972493115755014514434151351451873...
2     0.28683621751796128940270591556327417492866516...
3     0.68315691040978210324396968644578009843826293...
4     0.37188279135446988821200875463546253740787506...
5     0.29391070580371847498213355720508843660354614...
6     0.00281662212378319676275850724778138101100921...
7     0.68424426414243522120983698187046684324741363...
8     0.82298917655718306640721948497230187058448791...
9       0.104520052984668154749670065939426422119140625
10    0.36541026835682677287309161329176276922225952...
11    0.15450295421108961591016850434243679046630859375
12    0.28227045216414337058807859648368321359157562...
13    0.07428041682799846334717130957869812846183776...
14    0.76112360264534761888910452398704364895820617...
15    0.31991420667129122357152937183855101466178894...
16    0.43808559346706621440148410329129546880722045...
17    0.81185756018222499097447553140227682888507843...
18    0.18755848298291555309447176114190369844436645...
19    0.45064358633799639353156862853211350739002227...
20    0.01908941180337764276231382609694264829158782...
21    0.05994588216042351369594598509138450026512145...
22    0.45670931996552022180679841767414472997188568...
23    0.21696297680411980035586338999564759433269500...
24    0.77792921984939356061516946283518336713314056...
25    0.10732191741882657343154505724669434130191802...
26    0.26112839619732719498301776184234768152236938...
27    0.78917786400767853116633432364324107766151428...
28    0.11041422426646729793020540455472655594348907...
29    0.48501200782261966182318246865179389715194702...
                            ...
70    0.01573854854098000188855621672701090574264526...
71    0.44412972763803593156950455522746779024600982...
72    0.25386583135624463114510263039846904575824737...
73    0.18052436704533048050791421701433137059211730...
74    0.01768961191491325024571779067628085613250732...
75    0.19783353177328266703227654943475499749183654...
76    0.64275305832443418996291484290850348770618438...
77    0.56556945132078839666434078026213683187961578...
78    0.45692610540309075428666574225644581019878387...
79    0.54208092770195082099604633185663260519504547...
80    0.04475845682326173857745743589475750923156738...
81    0.02586933176340200368770183558808639645576477...
82    0.38495809393040258949980625402531586587429046...
83    0.09473826636927540345567422264139167964458465...
84    0.09800918702868577359055279885069467127323150...
85    0.06037114142095312274705065647140145301818847...
86    0.54181647584734216049895394462510012090206146...
87    0.39820409063224215806542360951425507664680480...
88    0.08573080156333134915769278450170531868934631...
89    0.31841090690861806322686788917053490877151489...
90    0.05991365875060672419039065061951987445354461...
91    0.20527392489879703330046822884469293057918548...
92    0.68151285114935533648861110123107209801673889...
93    0.33294634130492617440921776506002061069011688...
94    0.50598008087709389624109235228388570249080657...
95    0.31112635294260559959411693853326141834259033...
96    0.15173983856659867264227159466827288269996643...
97    0.12529127428765418628131556033622473478317260...
98    0.15472130908820258543556747099501080811023712...
99    0.36042221705147114985123835140257142484188079...
Length: 100, dtype: decimal

In [8]: cat1 = pd.Categorical(values=["one","two","three","three","two","one"],
   ...:
   ...:    ...:  categories=["one","two","three"], ordered=True)
   ...:    ...: cat2 = pd.Categorical(values=["three","two","one","one","two","
   ...: three"],
   ...:    ...:  categories=["one","two","three"], ordered=True)
   ...:    ...: s1 = pd.Series(cat1)
   ...:    ...: s2 = pd.Series(cat2)
   ...:    ...: s1, s2
   ...:
Out[8]:
(0      one
 1      two
 2    three
 3    three
 4      two
 5      one
 dtype: category
 Categories (3, object): [one < two < three], 0    three
 1      two
 2      one
 3      one
 4      two
 5    three
 dtype: category
 Categories (3, object): [one < two < three])

In [9]:  s1.combine(s2, lambda x1, x2: x1 <= x2)
Out[9]:
0     True
1     True
2    False
3    False
4     True
5     True
dtype: bool

In [10]: s = pd.Series([i*10 for i in range(5)])

In [11]: s.combine(22, lambda x,y: min(x,y))
Out[11]:
0     0
1    10
2    20
3    22
4    22
dtype: int64

In [12]: pd.__version__
Out[12]: '0.24.0.dev0+76.g2a21117c7.dirty'

@jreback
Copy link
Contributor

jreback commented Jun 6, 2018

i mean can u verify that u have test coverage for those cases

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Jun 6, 2018

@jreback wrote:

i mean can u verify that u have test coverage for those cases

The new tests subsume those cases. Should I explicitly include those cases?

@jorisvandenbossche
Copy link
Member

The new tests subsume those cases. Should I explicitly include those cases?

As long the case is part of the tests you added, that is fine.

@jorisvandenbossche
Copy link
Member

Regarding combine now working for categorical data while in 0.23.0 it failed (#21183 (comment)), I think that is fine.
At least that is the consequence of catching the TypeError possibly raised by ExtensionArray._from_sequence; which allows the output has another dtype as in the input from combine.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. a question.

if is_categorical_dtype(self.values):
pass
elif is_extension_array_dtype(self.values):
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what kind of op hits the type error here? I find this a bit disconcerting that you need to catch a TypeError? is this a case that the combine op returns a result which is not an extension type (e.g. say its an int or someting), is that the reason? if so pls indicate via a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Here is an example:

In [1]: import pandas as pd

In [2]: from pandas.tests.extension.decimal.array import DecimalArray, make_dat
   ...: a

In [3]: s = pd.Series(DecimalArray(make_data())).head()

In [4]: s.combine(0.5, lambda x, y: x < y)
Out[4]:
0    False
1    False
2    False
3    False
4    False
dtype: bool

Since the function passed to combine is an arbitrary function, it could return a result of any type, which may not be the type that will fit in the EA.

I'll add a comment.

@@ -181,6 +181,7 @@ Reshaping
Other
^^^^^

-
- :meth:`Series.combine()` works correctly with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move the ExtensionArray to a sub-section (create a new one) as lots of changes in EA (you can make a new bug fix section).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made a bug fix section

@@ -181,6 +181,7 @@ Reshaping
Other
^^^^^

-
- :meth:`Series.combine()` works correctly with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`)
- :meth:`Series.combine()` with scalar argument now works for any function type (:issue:`21248`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one can go in reshaping bug fixes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or into EA bug fixes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In EA bug fixes

@@ -181,6 +181,7 @@ Reshaping
Other
^^^^^

-
- :meth:`Series.combine()` works correctly with :class:`~pandas.api.extensions.ExtensionArray` inside of :class:`Series` (:issue:`20825`)
- :meth:`Series.combine()` with scalar argument now works for any function type (:issue:`21248`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or into EA bug fixes

@jreback jreback merged commit 7f6ea67 into pandas-dev:master Jun 8, 2018
@jreback
Copy link
Contributor

jreback commented Jun 8, 2018

thanks @Dr-Irv nice patch!

thanks for the patience. non-trivial things take more time :-D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
7 participants