Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support EAs in Series.unstack #23284

Merged
merged 34 commits into from Nov 7, 2018

Conversation

Projects
None yet
5 participants
@TomAugspurger
Copy link
Contributor

commented Oct 22, 2018

Closes #23077

This prevents ExtensionArray-backed series from being converted to object-dtype in unstack.

The strategy is to do a dummy unstack on an ndarray of integers, which provides the indices to take later on. We then concat together at the end. This provided decent performance, and seems pretty maintainable in the long run.

I'll post some benchmarks later.

Do we want to do DataFrame.stack() in the same PR?

@@ -102,7 +102,10 @@ def copy(self, deep=False):
def astype(self, dtype, copy=True):
if isinstance(dtype, type(self.dtype)):
return type(self)(self._data, context=dtype.context)
return super(DecimalArray, self).astype(dtype, copy)
# need to replace decimal NA

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Oct 22, 2018

Author Contributor

Series.equal doesn't consider Series([np.nan]) == Series([Decimal('NaN')]). I made this change mainly to facilitate that.

@pep8speaks

This comment has been minimized.

Copy link

commented Oct 22, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Comment last updated on October 22, 2018 at 21:41 Hours UTC
@jschendel
Copy link
Member

left a comment

A couple minor comments

@@ -947,3 +950,22 @@ def make_axis_dummies(frame, axis='minor', transform=None):
values = values.take(labels, axis=0)

return DataFrame(values, columns=items, index=frame.index)


def unstack_extension_series(series, level, fill_value):

This comment has been minimized.

Copy link
@jschendel

jschendel Oct 22, 2018

Member

Can you move this function up to around line 424? It looks like this file has all unstack related code grouped together first, followed by stack code grouped together, so having unstack_extension_series at the bottom seems a little out of place.

n = index.nlevels
levels = list(range(n))
# [0, 1, 2]
# -> [(0,), (1,), (2,) (0, 1), (1, 0)]

This comment has been minimized.

Copy link
@jschendel

jschendel Oct 22, 2018

Member

Shouldn't this be -> [(0,), (1,), (2,), (0, 1), (0, 2), (1, 0), (1, 2), (2, 0), (2, 1)]? Not super important, but caused me a brief moment of confusion.

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Oct 23, 2018

Author Contributor

Yes, you're correct.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 23, 2018

Just fixed the decimal failures.

There will be a remaining test failure I haven't addressed yet. We had a test that did Series[categorical].unstack(fill_value=value) for a value that wasn't part of the original categories in the categorical. We aren't take correctly right now (#23296), but once that's fixed there's still an API discussion: should we allow take to take "new" categories that weren't previously present.

In [2]: cat = pd.Categorical(['a', 'a', 'b'])

In [3]: cat.take([0, -1, -1], fill_value='d', allow_fill=True)

Should that raise? Return a Categorical with categories ['a', 'b', 'd']?

I'm having some deja vu right now; I think we've discussed this before.

I think if we were designing that today, we wouldn't have allowed that.

@@ -807,6 +807,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`)
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`).
- Support for reduction operations such as ``sum``, ``mean`` via opt-in base class method override (:issue:`22762`)
- :meth:`Series.unstack` no longer converts extension arrays to object-dtype ndarrays. The output ``DataFrame`` will now have the same dtype as the input. This changes behavior for Categorical and Sparse data (:issue:`23077`).

This comment has been minimized.

Copy link
@jreback

jreback Oct 23, 2018

Contributor

really? what does this change for Categorical?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Oct 23, 2018

Author Contributor

Previously Series[Categorical].unstack() returned DataFrame[object].

Now it'll be a DataFrame[Categorical], i.e. unstack() preserves the CategoricalDtype.

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Oct 23, 2018

Author Contributor

Ah, I forget. Previously, we went internally went Categorical -> object -> Categorical. Now we avoid the conversion to categorical.

So the changes from 0.23.4 will be

  1. Series[category].unstack() avoids a conversion to object
  2. Series[Sparse].unstack is sparse (no intermediate conversion to dense)

Onces DatetimeTZ is an ExtensionArray, then we'll presumably preserve that as well. On 0.23.4, we convert to datetime64ns

In [48]: index = pd.MultiIndex.from_tuples([('A', 0), ('A', 1), ('B', 1)])

In [49]: ser = pd.Series(pd.date_range('2000', periods=3, tz="US/Central"), index=index)

In [50]: ser.unstack().dtypes
Out[50]:
0    datetime64[ns]
1    datetime64[ns]
dtype: object

This comment has been minimized.

Copy link
@jreback

jreback Oct 24, 2018

Contributor

ok, this might be need a larger note then

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 23, 2018

I think if we were designing that today, we wouldn't have allowed that.

I'm actually rethinking, this. Maybe we would want to allow this. It's a pretty clear statement of user intent, and I could easily imaging someone wanting to do something like "take, but fill missing values (-1) with "None" or "other".

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 23, 2018

Do you want to resolve the fill_value` question here, or leave for #23296 ?

(as I mentioned there: I would preserve the dtype, which then means only allowing a fill_value that is NaN or within the categories)

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 23, 2018

We can ignore fill_value in this PR. Though of course we should discuss the API ramifications of unstack preserving the dtype.

If we agree that Categorical.take should not allow new categories for fill_value, I think we have two options

  1. Not allow Series.unstack(..., fill_value=fill_value) to be a new category. Raise a TypeError instead.
  2. Allow Series.unstack(..., fill_value=fill_value)to be a new category by adding it to the CategoricalDtype beforetake`ing.

I'm not sure which is preferred.

This is blocked by #23296 for now.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 23, 2018

Apparently, fill_value was broken in 0.23.4 for Categortical

In [51]: index = pd.MultiIndex.from_tuples([('A', 0), ('A', 1), ('B', 1)])

In [52]: ser = pd.Series(pd.Categorical(['a', 'b', 'a']), index=index)

In [53]: ser.unstack(fill_value='c')
Out[53]:
     0  1
A    a  b
B  NaN  a

In [54]: ser.unstack(fill_value='a')
Out[54]:
   0  1
A  a  b
B  a  a

We just silently didn't fill the 'd'. So I guess we just raise an error there instead?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 23, 2018

So I guess we just raise an error there instead?

+1

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 23, 2018

https://github.com/pandas-dev/pandas/pull/10246/files#diff-79e0785420ae1c686623848c4d561486R261 indicates that this was deliberate, but I didn't see any discussion / documentation around it, so I'm calling it a bug.

@jorisvandenbossche
Copy link
Member

left a comment

Does this now also work for unstacking a DataFrame with an EA column? If so, maybe add that to the test case?

@codecov

This comment has been minimized.

Copy link

commented Oct 24, 2018

Codecov Report

Merging #23284 into master will decrease coverage by 0.01%.
The diff coverage is 97.29%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23284      +/-   ##
==========================================
- Coverage   92.25%   92.23%   -0.02%     
==========================================
  Files         161      161              
  Lines       51186    51198      +12     
==========================================
+ Hits        47222    47224       +2     
- Misses       3964     3974      +10
Flag Coverage Δ
#multiple 90.62% <97.29%> (-0.02%) ⬇️
#single 42.27% <27.02%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals/managers.py 95.74% <100%> (ø) ⬆️
pandas/core/reshape/reshape.py 99.54% <100%> (-0.01%) ⬇️
pandas/core/internals/blocks.py 93.67% <95%> (-0.36%) ⬇️
pandas/core/arrays/base.py 97.35% <0%> (-0.67%) ⬇️
pandas/core/arrays/datetimelike.py 95.83% <0%> (-0.27%) ⬇️
pandas/core/arrays/categorical.py 95.09% <0%> (-0.13%) ⬇️
pandas/core/arrays/sparse.py 91.71% <0%> (-0.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6bf6cd2...56e5f2f. Read the comment docs.

@jreback
Copy link
Contributor

left a comment

lgtm, module a subsection in the docs

@@ -807,6 +807,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`)
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`).
- Support for reduction operations such as ``sum``, ``mean`` via opt-in base class method override (:issue:`22762`)
- :meth:`Series.unstack` no longer converts extension arrays to object-dtype ndarrays. The output ``DataFrame`` will now have the same dtype as the input. This changes behavior for Categorical and Sparse data (:issue:`23077`).

This comment has been minimized.

Copy link
@jreback

jreback Oct 24, 2018

Contributor

ok, this might be need a larger note then

@jreback

jreback approved these changes Nov 5, 2018

Copy link
Contributor

left a comment

happy to merge, just some questions about the decimal nan checking

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2018

K, I'll fix the pd.isna(Decimal('NaN')) and revert the other changes.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2018

Hmm, fixing isna for decimal seems to be expensive. np.isnan raises on decimal input, so we would need to search any object array for decimal elements, and call our isna on it. I don't think that's a good idea.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

Hmm, fixing isna for decimal seems to be expensive. np.isnan raises on decimal input, so we would need to search any object array for decimal elements, and call our isna on it. I don't think that's a good idea.

this would only be for object input, and can you all math.isnan? (maybe instead)?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2018

Ahh I thought we would have to pay the extra cost on string dtypes too, but it seems like those are handled before we get to a generic object dtype. This should be doable.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2018

Well, I'm going back to -1 on support decimal here, unless we can find a better way than a basic isnstance.

diff --git a/doc/source/whatsnew/v0.24.0.txt b/doc/source/whatsnew/v0.24.0.txt
index f449ca532..c8c5db611 100644
--- a/doc/source/whatsnew/v0.24.0.txt
+++ b/doc/source/whatsnew/v0.24.0.txt
@@ -1227,6 +1227,7 @@ Missing
 - Bug in :func:`Series.hasnans` that could be incorrectly cached and return incorrect answers if null elements are introduced after an initial call (:issue:`19700`)
 - :func:`Series.isin` now treats all NaN-floats as equal also for `np.object`-dtype. This behavior is consistent with the behavior for float64 (:issue:`22119`)
 - :func:`unique` no longer mangles NaN-floats and the ``NaT``-object for `np.object`-dtype, i.e. ``NaT`` is no longer coerced to a NaN-value and is treated as a different entity. (:issue:`22295`)
+- :meth:`isna` now considers ``decimal.Decimal('NaN')`` a missing value (:issue:`23284`).
 
 
 MultiIndex
diff --git a/pandas/_libs/missing.pyx b/pandas/_libs/missing.pyx
index b87913592..4fa96f652 100644
--- a/pandas/_libs/missing.pyx
+++ b/pandas/_libs/missing.pyx
@@ -1,6 +1,7 @@
 # -*- coding: utf-8 -*-
 
 import cython
+import decimal
 from cython import Py_ssize_t
 
 import numpy as np
@@ -33,6 +34,8 @@ cdef inline bint _check_all_nulls(object val):
         res = get_datetime64_value(val) == NPY_NAT
     elif util.is_timedelta64_object(val):
         res = get_timedelta64_value(val) == NPY_NAT
+    elif isinstance(val, decimal.Decimal):
+        return val.is_nan()
     else:
         res = 0
     return res
@@ -71,6 +74,8 @@ cpdef bint checknull(object val):
         return get_timedelta64_value(val) == NPY_NAT
     elif util.is_array(val):
         return False
+    elif isinstance(val, decimal.Decimal):
+        return val.is_nan()
     else:
         return val is None or util.is_nan(val)
 

some timings

kind master PR ratio
scalar 821 ns 920 ns 1.12 (different result)
object array 1.0 ms 2.4 ms 2.4
decimal array 1.0 ms 3.6 ms 3.6 (different result)

the object array is an object-dtype series with 20,000 elements. The decimal array is an object-dtype series with 20,000 decimal elements. I don't really care about the last one being 3.6x slower, since we're getting the correct result. I'm more concerned about the others.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

this moves calls to python land. Try

elif hasattr(val, 'is_nan'):
    return val.is_nan()
@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2018

According to https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#built-in-functions both hasattr and isintsance are optimized (but require some interaction with python land).

kind master PR ratio
scalar 821 ns 926 ns 1.12 (different result)
object array 1.0 ms 6.6 ms 6.6
decimal array 1.0 ms 2.0 ms 2.0 (different result)

so pretty similar. I don't really know why the object array would be 6x slower now though.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

hmm i guess just the additional check is causing this.

but a more general questions. should we even be checking for this in an ndarray object array at all? e.g. we don't do this for a random foo object. It must be a decimal array (in which case you just ask it .isnan).?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 6, 2018

Shall we leave possible changes/support for decimal in the internals for another issue or PR?
Because in some sense it is exactly the idea of an ExtensionArray that it can override the behaviour of pd.isna to do this correctly no?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 6, 2018

Shall we leave possible changes/support for decimal in the internals for another issue or PR?
Because in some sense it is exactly the idea of an ExtensionArray that it can override the behaviour of pd.isna to do this correctly no?

totally fine. @TomAugspurger can you xfail these tests rather than change them though. and create an issue to update.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 6, 2018

can you xfail these tests rather than change them though. and create an issue to update.

Isn't the current code fine? It's contained in test_decimal.py, I don't think it is a problem that there is some decimal specific code in that file?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2018

#23530 for isna(decimal).

Fixed the merge conflict.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 6, 2018

All green.

@jorisvandenbossche jorisvandenbossche merged commit 28a42da into pandas-dev:master Nov 7, 2018

3 checks passed

ci/circleci: py36_locale Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20181106.67 has test failures
Details
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 7, 2018

@TomAugspurger Thanks!

thoo added a commit to thoo/pandas that referenced this pull request Nov 10, 2018

Merge remote-tracking branch 'upstream/master' into io_csv_docstring_…
…fixed

* upstream/master: (47 commits)
  CLN: remove values attribute from datetimelike EAs (pandas-dev#23603)
  DOC/CI: Add linting to rst files, and fix issues (pandas-dev#23381)
  PERF: Speeds up creation of Period, PeriodArray, with Offset freq (pandas-dev#23589)
  PERF: define is_all_dates to shortcut inadvertent copy when slicing an IntervalIndex (pandas-dev#23591)
  TST: Tests and Helpers for Datetime/Period Arrays (pandas-dev#23502)
  Update description of Index._values/values/ndarray_values (pandas-dev#23507)
  Fixes to make validate_docstrings.py not generate warnings or unwanted output (pandas-dev#23552)
  DOC: Added note about groupby excluding Decimal columns by default (pandas-dev#18953)
  ENH: Support writing timestamps with timezones with to_sql (pandas-dev#22654)
  CI: Auto-cancel redundant builds (pandas-dev#23523)
  Preserve EA dtype in DataFrame.stack (pandas-dev#23285)
  TST: Fix dtype mismatch on 32bit in IntervalTree get_indexer test (pandas-dev#23468)
  BUG: raise if invalid freq is passed (pandas-dev#23546)
  remove uses of (ts)?lib.(NaT|iNaT|Timestamp) (pandas-dev#23562)
  BUG: Fix error message for invalid HTML flavor (pandas-dev#23550)
  ENH: Support EAs in Series.unstack (pandas-dev#23284)
  DOC: Updating DataFrame.join docstring (pandas-dev#23471)
  TST: coverage for skipped tests in io/formats/test_to_html.py (pandas-dev#22888)
  BUG: Return KeyError for invalid string key (pandas-dev#23540)
  BUG: DatetimeIndex slicing with boolean Index raises TypeError (pandas-dev#22852)
  ...

JustinZhengBC added a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018

brute4s99 added a commit to brute4s99/pandas that referenced this pull request Nov 19, 2018

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.