Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve EA dtype in DataFrame.stack #23285

Merged
merged 16 commits into from
Nov 8, 2018

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Oct 22, 2018

closes #23077

There were two bugs in master (not present in 0.23.4), probably from the SparseArray PR

  1. We need to unbox the EA values from Series before passing to EA._concat_same_type
  2. We need to followup with a take to get the correct order.

@TomAugspurger TomAugspurger changed the title Preserve EA dtype in DataFrame.stack [WIP]Preserve EA dtype in DataFrame.stack Oct 22, 2018
@TomAugspurger
Copy link
Contributor Author

Just a WIP for now. I feel like we're lacking on tests here. Will add one for stacking a single level from a dataframe with a MultiIndex in the columns.

@jreback jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 22, 2018
@pep8speaks
Copy link

pep8speaks commented Oct 22, 2018

Hello @TomAugspurger! Thanks for updating the PR.

Comment last updated on October 24, 2018 at 21:13 Hours UTC

@TomAugspurger
Copy link
Contributor Author

"Fixed" the sparse values. We were failing to handle DataFrame[SparseArray].astype(object) correctly. On master, Series[sparse].astype(object) / Frame[sparse].astype(object) is sparse, but I think we want to change that. #23125

I'll probably do that today, and then pick up the reshaping PRs afterwards.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still WIP ?
It looks good to me.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 23, 2018 via email

@codecov
Copy link

codecov bot commented Oct 23, 2018

Codecov Report

Merging #23285 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #23285      +/-   ##
==========================================
+ Coverage   92.24%   92.25%   +<.01%     
==========================================
  Files         161      161              
  Lines       51224    51237      +13     
==========================================
+ Hits        47254    47269      +15     
+ Misses       3970     3968       -2
Flag Coverage Δ
#multiple 90.63% <100%> (ø) ⬆️
#single 42.27% <5.88%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals/blocks.py 93.67% <ø> (ø) ⬆️
pandas/core/reshape/reshape.py 99.56% <100%> (+0.01%) ⬆️
pandas/core/arrays/categorical.py 95.34% <0%> (+0.25%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8212001...f6aeafa. Read the comment docs.

if dtype == np.object_:
# sparse is "special" and preserves sparsity.
# We're changing this in GH-23125
if dtype == np.object_ and is_sparse(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use is_object_dtype

@TomAugspurger TomAugspurger changed the title [WIP]Preserve EA dtype in DataFrame.stack Preserve EA dtype in DataFrame.stack Oct 24, 2018
@jreback jreback added this to the 0.24.0 milestone Oct 26, 2018
if is_sparse(self.values):
# Series[Sparse].astype(object) is sparse.
klass = ExtensionBlock
elif is_object_dtype(dtype):
klass = ObjectBlock
elif is_extension_array_dtype(dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so maybe should just move the is_extension_array_dtype up to first, and add a is_extension_dtype(self.values) test as well (should encompas your is_sparse check) and is more general

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make that change and run the test suite.

I was kinda worried about "false positives" here, but I suppose it's exactly what we want if an extension array claims it's object dtype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As posted in the unstack PR, we need to special case Space here, since it's the only (internal) extension type that has special .astype(object) behavior.

pandas/core/reshape/reshape.py Outdated Show resolved Hide resolved
pandas/core/reshape/reshape.py Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Nov 1, 2018

can you rebase

@@ -849,7 +849,7 @@ update the ``ExtensionDtype._metadata`` tuple to match the signature of your
- Updated the ``.type`` attribute for ``PeriodDtype``, ``DatetimeTZDtype``, and ``IntervalDtype`` to be instances of the dtype (``Period``, ``Timestamp``, and ``Interval`` respectively) (:issue:`22938`)
- :func:`ExtensionArray.isna` is allowed to return an ``ExtensionArray`` (:issue:`22325`).
- Support for reduction operations such as ``sum``, ``mean`` via opt-in base class method override (:issue:`22762`)
- :meth:`Series.unstack` no longer converts extension arrays to object-dtype ndarrays. The output ``DataFrame`` will now have the same dtype as the input. This changes behavior for Categorical and Sparse data (:issue:`23077`).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was accidentally added in the PeriodArray PR. Will be implemented for good in #23284

expected = df.astype(object).stack()
# we need a second astype(object), in case the constructor inferred
# object -> specialized, as is done for period.
expected = expected.astype(object)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda strange. For DataFrame[ndarray[object]].stack() of all periods, we actually infer period-dtype. Do we want that, or should we explicitly pass dtype=object when creating the new series / frame to ensure that we don't infer the "correct" dtype?

In [1]: import pandas as pd

In [2]: a = pd.core.arrays.period_array(['2000', '2001'], freq='D')

In [3]: pd.DataFrame({"A": a, "B": a}).astype(object).dtypes
Out[3]:
A    object
B    object
dtype: object

In [4]: pd.DataFrame({"A": a, "B": a}).astype(object).stack().dtype
Out[4]: period[D]

(that's on master)

@TomAugspurger
Copy link
Contributor Author

All green.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 7, 2018

Merged master to fix the merge conflict with the unstack PR. Will ping on green.

@jreback
Copy link
Contributor

jreback commented Nov 8, 2018

lgtm.

@jorisvandenbossche jorisvandenbossche merged commit 8ae1afe into pandas-dev:master Nov 8, 2018
thoo added a commit to thoo/pandas that referenced this pull request Nov 10, 2018
…fixed

* upstream/master: (47 commits)
  CLN: remove values attribute from datetimelike EAs (pandas-dev#23603)
  DOC/CI: Add linting to rst files, and fix issues (pandas-dev#23381)
  PERF: Speeds up creation of Period, PeriodArray, with Offset freq (pandas-dev#23589)
  PERF: define is_all_dates to shortcut inadvertent copy when slicing an IntervalIndex (pandas-dev#23591)
  TST: Tests and Helpers for Datetime/Period Arrays (pandas-dev#23502)
  Update description of Index._values/values/ndarray_values (pandas-dev#23507)
  Fixes to make validate_docstrings.py not generate warnings or unwanted output (pandas-dev#23552)
  DOC: Added note about groupby excluding Decimal columns by default (pandas-dev#18953)
  ENH: Support writing timestamps with timezones with to_sql (pandas-dev#22654)
  CI: Auto-cancel redundant builds (pandas-dev#23523)
  Preserve EA dtype in DataFrame.stack (pandas-dev#23285)
  TST: Fix dtype mismatch on 32bit in IntervalTree get_indexer test (pandas-dev#23468)
  BUG: raise if invalid freq is passed (pandas-dev#23546)
  remove uses of (ts)?lib.(NaT|iNaT|Timestamp) (pandas-dev#23562)
  BUG: Fix error message for invalid HTML flavor (pandas-dev#23550)
  ENH: Support EAs in Series.unstack (pandas-dev#23284)
  DOC: Updating DataFrame.join docstring (pandas-dev#23471)
  TST: coverage for skipped tests in io/formats/test_to_html.py (pandas-dev#22888)
  BUG: Return KeyError for invalid string key (pandas-dev#23540)
  BUG: DatetimeIndex slicing with boolean Index raises TypeError (pandas-dev#22852)
  ...
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle ExtensionArrays in Series.unstack / DataFrame.stack
4 participants