Preserve EA dtype in DataFrame.stack #23285

TomAugspurger · 2018-10-22T20:38:21Z

There were two bugs in master (not present in 0.23.4), probably from the SparseArray PR

We need to unbox the EA values from Series before passing to EA._concat_same_type
We need to followup with a take to get the correct order.

TomAugspurger · 2018-10-22T20:39:13Z

Just a WIP for now. I feel like we're lacking on tests here. Will add one for stacking a single level from a dataframe with a MultiIndex in the columns.

pandas/core/reshape/reshape.py

pep8speaks · 2018-10-22T21:20:55Z

Hello @TomAugspurger! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/core/internals/blocks.py !
There are no PEP8 issues in the file pandas/core/reshape/reshape.py !
There are no PEP8 issues in the file pandas/tests/extension/base/reshaping.py !
There are no PEP8 issues in the file pandas/tests/extension/json/test_json.py !
There are no PEP8 issues in the file pandas/tests/frame/test_reshape.py !
There are no PEP8 issues in the file pandas/tests/sparse/frame/test_frame.py !

Comment last updated on October 24, 2018 at 21:13 Hours UTC

TomAugspurger · 2018-10-23T16:31:32Z

"Fixed" the sparse values. We were failing to handle DataFrame[SparseArray].astype(object) correctly. On master, Series[sparse].astype(object) / Frame[sparse].astype(object) is sparse, but I think we want to change that. #23125

I'll probably do that today, and then pick up the reshaping PRs afterwards.

jorisvandenbossche

Is this still WIP ?
It looks good to me.

TomAugspurger · 2018-10-23T22:05:38Z

I think we need tests where the columns are a MultiIndex.

…

________________________________ From: Joris Van den Bossche <notifications@github.com> Sent: Tuesday, October 23, 2018 4:24:31 PM To: pandas-dev/pandas Cc: Tom Augspurger; Mention Subject: Re: [pandas-dev/pandas] [WIP]Preserve EA dtype in DataFrame.stack (#23285) @jorisvandenbossche commented on this pull request. Is this still WIP ? It looks good to me. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#23285 (review)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIqd_gEfs-HuVEgu9F_eRaTpryWp9ks5un4kPgaJpZM4X0JPS>.

codecov · 2018-10-23T22:59:49Z

Codecov Report

Merging #23285 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #23285      +/-   ##
==========================================
+ Coverage   92.24%   92.25%   +<.01%     
==========================================
  Files         161      161              
  Lines       51224    51237      +13     
==========================================
+ Hits        47254    47269      +15     
+ Misses       3970     3968       -2

Flag	Coverage Δ
#multiple	`90.63% <100%> (ø)`	⬆️
#single	`42.27% <5.88%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/internals/blocks.py	`93.67% <ø> (ø)`	⬆️
pandas/core/reshape/reshape.py	`99.56% <100%> (+0.01%)`	⬆️
pandas/core/arrays/categorical.py	`95.34% <0%> (+0.25%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8212001...f6aeafa. Read the comment docs.

jreback · 2018-10-24T12:35:48Z

pandas/core/internals/blocks.py

-            if dtype == np.object_:
+            # sparse is "special" and preserves sparsity.
+            # We're changing this in GH-23125
+            if dtype == np.object_ and is_sparse(values):


use is_object_dtype

jreback · 2018-10-26T00:40:24Z

pandas/core/internals/blocks.py

+            if is_sparse(self.values):
+                # Series[Sparse].astype(object) is sparse.
+                klass = ExtensionBlock
+            elif is_object_dtype(dtype):
                klass = ObjectBlock
            elif is_extension_array_dtype(dtype):


so maybe should just move the is_extension_array_dtype up to first, and add a is_extension_dtype(self.values) test as well (should encompas your is_sparse check) and is more general

I'll make that change and run the test suite.

I was kinda worried about "false positives" here, but I suppose it's exactly what we want if an extension array claims it's object dtype.

As posted in the unstack PR, we need to special case Space here, since it's the only (internal) extension type that has special .astype(object) behavior.

pandas/core/reshape/reshape.py

jreback · 2018-11-01T01:21:32Z

can you rebase

TomAugspurger · 2018-11-05T16:13:22Z

doc/source/whatsnew/v0.24.0.txt

@@ -849,7 +849,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
 - Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`)
 - :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`).
 - Support for reduction operations such as ``sum``, ``mean`` via opt-in base class method override (:issue:`22762`)
- :meth:`Series.unstack` no longer converts extension arrays to object-dtype ndarrays. The output ``DataFrame`` will now have the same dtype as the input. This changes behavior for Categorical and Sparse data (:issue:`23077`).


This was accidentally added in the PeriodArray PR. Will be implemented for good in #23284

TomAugspurger · 2018-11-05T16:14:46Z

pandas/tests/extension/base/reshaping.py

+        expected = df.astype(object).stack()
+        # we need a second astype(object), in case the constructor inferred
+        # object -> specialized, as is done for period.
+        expected = expected.astype(object)


This is kinda strange. For DataFrame[ndarray[object]].stack() of all periods, we actually infer period-dtype. Do we want that, or should we explicitly pass dtype=object when creating the new series / frame to ensure that we don't infer the "correct" dtype?

In [1]: import pandas as pd In [2]: a = pd.core.arrays.period_array(['2000', '2001'], freq='D') In [3]: pd.DataFrame({"A": a, "B": a}).astype(object).dtypes Out[3]: A object B object dtype: object In [4]: pd.DataFrame({"A": a, "B": a}).astype(object).stack().dtype Out[4]: period[D]

(that's on master)

TomAugspurger · 2018-11-07T00:36:03Z

All green.

TomAugspurger · 2018-11-07T15:45:30Z

Merged master to fix the merge conflict with the unstack PR. Will ping on green.

jreback · 2018-11-08T12:45:30Z

lgtm.

…fixed * upstream/master: (47 commits) CLN: remove values attribute from datetimelike EAs (pandas-dev#23603) DOC/CI: Add linting to rst files, and fix issues (pandas-dev#23381) PERF: Speeds up creation of Period, PeriodArray, with Offset freq (pandas-dev#23589) PERF: define is_all_dates to shortcut inadvertent copy when slicing an IntervalIndex (pandas-dev#23591) TST: Tests and Helpers for Datetime/Period Arrays (pandas-dev#23502) Update description of Index._values/values/ndarray_values (pandas-dev#23507) Fixes to make validate_docstrings.py not generate warnings or unwanted output (pandas-dev#23552) DOC: Added note about groupby excluding Decimal columns by default (pandas-dev#18953) ENH: Support writing timestamps with timezones with to_sql (pandas-dev#22654) CI: Auto-cancel redundant builds (pandas-dev#23523) Preserve EA dtype in DataFrame.stack (pandas-dev#23285) TST: Fix dtype mismatch on 32bit in IntervalTree get_indexer test (pandas-dev#23468) BUG: raise if invalid freq is passed (pandas-dev#23546) remove uses of (ts)?lib.(NaT|iNaT|Timestamp) (pandas-dev#23562) BUG: Fix error message for invalid HTML flavor (pandas-dev#23550) ENH: Support EAs in Series.unstack (pandas-dev#23284) DOC: Updating DataFrame.join docstring (pandas-dev#23471) TST: coverage for skipped tests in io/formats/test_to_html.py (pandas-dev#22888) BUG: Return KeyError for invalid string key (pandas-dev#23540) BUG: DatetimeIndex slicing with boolean Index raises TypeError (pandas-dev#22852) ...

Preserve EA dtype in DataFrame.stack

381b073

TomAugspurger changed the title ~~Preserve EA dtype in DataFrame.stack~~ [WIP]Preserve EA dtype in DataFrame.stack Oct 22, 2018

jreback reviewed Oct 22, 2018

View reviewed changes

pandas/core/reshape/reshape.py Outdated Show resolved Hide resolved

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 22, 2018

sparse

428f230

jorisvandenbossche reviewed Oct 23, 2018

View reviewed changes

TomAugspurger added 2 commits October 24, 2018 06:34

multi test

0d39be0

Merge remote-tracking branch 'upstream/master' into ea-stack

fc37932

jreback requested changes Oct 24, 2018

View reviewed changes

TomAugspurger added 3 commits October 24, 2018 14:00

Merge remote-tracking branch 'upstream/master' into ea-stack

7bb5a5e

multiple columns

7e9224a

remove pdb

d6661cb

TomAugspurger changed the title ~~[WIP]Preserve EA dtype in DataFrame.stack~~ Preserve EA dtype in DataFrame.stack Oct 24, 2018

Merge remote-tracking branch 'upstream/master' into ea-stack

3d41f5b

jreback added this to the 0.24.0 milestone Oct 26, 2018

jreback requested changes Oct 26, 2018

View reviewed changes

TomAugspurger added 3 commits November 5, 2018 09:48

Merge remote-tracking branch 'upstream/master' into ea-stack

9f91df0

really object

144d117

remove loc

98f75c9

TomAugspurger commented Nov 5, 2018

View reviewed changes

TomAugspurger added 2 commits November 5, 2018 20:38

Merge remote-tracking branch 'upstream/master' into ea-stack

88f7f3e

Fixed merge conflict

2b858b8

Merge remote-tracking branch 'upstream/master' into ea-stack

88f08c7

TomAugspurger added 2 commits November 8, 2018 06:25

Merge remote-tracking branch 'upstream/master' into ea-stack

d305c86

lint

f6aeafa

jreback approved these changes Nov 8, 2018

View reviewed changes

jorisvandenbossche merged commit 8ae1afe into pandas-dev:master Nov 8, 2018

JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018

Preserve EA dtype in DataFrame.stack (pandas-dev#23285)

af33308

tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018

Preserve EA dtype in DataFrame.stack (pandas-dev#23285)

ecd96e9

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Preserve EA dtype in DataFrame.stack (pandas-dev#23285)

00e0525

Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Preserve EA dtype in DataFrame.stack (pandas-dev#23285)

1d4db94

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve EA dtype in DataFrame.stack #23285

Preserve EA dtype in DataFrame.stack #23285

TomAugspurger commented Oct 22, 2018 •

edited

Loading

TomAugspurger commented Oct 22, 2018

pep8speaks commented Oct 22, 2018 •

edited

Loading

TomAugspurger commented Oct 23, 2018

jorisvandenbossche left a comment

TomAugspurger commented Oct 23, 2018 via email

codecov bot commented Oct 23, 2018 •

edited

Loading

jreback Oct 24, 2018

jreback Oct 26, 2018

TomAugspurger Oct 26, 2018

TomAugspurger Nov 5, 2018

jreback commented Nov 1, 2018

TomAugspurger Nov 5, 2018

TomAugspurger Nov 5, 2018

TomAugspurger commented Nov 7, 2018

TomAugspurger commented Nov 7, 2018 •

edited

Loading

jreback commented Nov 8, 2018

Preserve EA dtype in DataFrame.stack #23285

Preserve EA dtype in DataFrame.stack #23285

Conversation

TomAugspurger commented Oct 22, 2018 • edited Loading

TomAugspurger commented Oct 22, 2018

pep8speaks commented Oct 22, 2018 • edited Loading

Comment last updated on October 24, 2018 at 21:13 Hours UTC

TomAugspurger commented Oct 23, 2018

jorisvandenbossche left a comment

Choose a reason for hiding this comment

TomAugspurger commented Oct 23, 2018 via email

codecov bot commented Oct 23, 2018 • edited Loading

Codecov Report

jreback Oct 24, 2018

Choose a reason for hiding this comment

jreback Oct 26, 2018

Choose a reason for hiding this comment

TomAugspurger Oct 26, 2018

Choose a reason for hiding this comment

TomAugspurger Nov 5, 2018

Choose a reason for hiding this comment

jreback commented Nov 1, 2018

TomAugspurger Nov 5, 2018

Choose a reason for hiding this comment

TomAugspurger Nov 5, 2018

Choose a reason for hiding this comment

TomAugspurger commented Nov 7, 2018

TomAugspurger commented Nov 7, 2018 • edited Loading

jreback commented Nov 8, 2018

TomAugspurger commented Oct 22, 2018 •

edited

Loading

pep8speaks commented Oct 22, 2018 •

edited

Loading

codecov bot commented Oct 23, 2018 •

edited

Loading

TomAugspurger commented Nov 7, 2018 •

edited

Loading