Handle ExtensionArrays in Series.unstack / DataFrame.stack #23077

TomAugspurger · 2018-10-10T20:56:47Z

Here's a test for Series.unstack

diff --git a/pandas/tests/extension/base/reshaping.py b/pandas/tests/extension/base/reshaping.py
index 0340289e0..a040fba63 100644
--- a/pandas/tests/extension/base/reshaping.py
+++ b/pandas/tests/extension/base/reshaping.py
@@ -171,3 +171,16 @@ class BaseReshapingTests(BaseExtensionTests):
                  [data[0], data[0], data[1], data[2], na_value],
                  dtype=data.dtype)})
         self.assert_frame_equal(res, exp[['ext', 'int1', 'key', 'int2']])
+
+    def test_unstack(self, data):
+        data = data[:4]
+        ser = pd.Series(
+            data,
+            index=pd.MultiIndex.from_product([["A", "B"],
+                                              ["a", "b"]]),
+        )
+        result = ser.unstack()
+        expected = pd.DataFrame({"a": data.take([0, 2]),
+                                 "b": data.take([1, 3])},
+                                index=['A', 'B'])
+        self.assert_frame_equal(result, expected)

We don't do so well right now. Just categorical passes.

========================================================================= FAILURES =========================================================================
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:204: AttributeError: 'IntervalArray' object has no attribute 'reshape'
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/decimal/array.py:55: TypeError: All values must be of type <class 'decimal.Decimal'>
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/json/array.py:88: TypeError: list indices must be integers or slices, not NoneType

No test for DataFrame.stack. In https://github.com/pandas-dev/pandas/pull/22862/files there's a WIP for stack that's based around ExtensionArray._concat_same_type instead of .reshape

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-10-18T02:16:52Z

I'm working on this now.

TomAugspurger · 2018-10-21T12:23:17Z

I have a WIP for Series[extension_array].unstack at master...TomAugspurger:ea-unstack

We can write unstack as a composition of

loc (take)
reindex
concat

It seems to work... but it's quite slow. If you're reshaping to (n_rows, n_columns), you'll end up with n_columns locs and reindexes each.

We should be able to improve the (maybe common?) case of a "uniform" index (like what you get from from_product), without too much additional code. That's just a bunch of uniform takes with no reindexes.

jorisvandenbossche · 2018-10-22T11:20:42Z

@TomAugspurger is it slower compared to the current implementation for non-consolidatable blocks? Or how does it currently work for them?

TomAugspurger · 2018-10-22T19:52:51Z

DataFrame.stack() has a regression from 0.23.4

In [3]: df = pd.DataFrame({"A": pd.Categorical(['a', 'b']), "B": pd.Categorical(['a', 'b'])})

In [4]: df.stack()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-c3a581ae119f> in <module>
----> 1 df.stack()

~/sandbox/pandas/pandas/core/frame.py in stack(self, level, dropna)
   5753             return stack_multiple(self, level, dropna=dropna)
   5754         else:
-> 5755             return stack(self, level, dropna=dropna)
   5756
   5757     def unstack(self, level=-1, fill_value=None):

~/sandbox/pandas/pandas/core/reshape/reshape.py in stack(frame, level, dropna)
    471             arr = dtype.construct_array_type()
    472             new_values = arr._concat_same_type([
--> 473                 col for _, col in frame.iteritems()
    474             ])
    475         else:

~/sandbox/pandas/pandas/core/arrays/categorical.py in _concat_same_type(self, to_concat)
   2286         from pandas.core.dtypes.concat import _concat_categorical
   2287
-> 2288         return _concat_categorical(to_concat)
   2289
   2290     def _formatting_values(self):

~/sandbox/pandas/pandas/core/dtypes/concat.py in _concat_categorical(to_concat, axis)
    239         # when all categories are identical
    240         first = to_concat[0]
--> 241         if all(first.is_dtype_equal(other) for other in to_concat[1:]):
    242             return union_categoricals(categoricals)
    243

~/sandbox/pandas/pandas/core/dtypes/concat.py in <genexpr>(.0)
    239         # when all categories are identical
    240         first = to_concat[0]
--> 241         if all(first.is_dtype_equal(other) for other in to_concat[1:]):
    242             return union_categoricals(categoricals)
    243

~/sandbox/pandas/pandas/core/generic.py in __getattr__(self, name)
   4636             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   4637                 return self[name]
-> 4638             return object.__getattribute__(self, name)
   4639
   4640     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'is_dtype_equal'

previously that returned an object-dtype Series. Ideally, it would be a categorical.

TomAugspurger · 2018-10-22T19:54:03Z

@jorisvandenbossche I have a new implementation that lowers the overhead for EAs. Will post the timings in the PR (once I've written them).

It'll basically be identical to previously, but

ExtensionArrays are no longer cast to ndarrays (good)
An extra ExtensionArray.take per column + one extra concat (bad)

TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Oct 10, 2018

TomAugspurger added this to the 0.24.0 milestone Oct 10, 2018

This was referenced Oct 10, 2018

REF: Make PeriodArray an ExtensionArray #22862

Merged

Datetimelike Array Refactor #23185

Closed

TomAugspurger added this to Orthogonal Blockers in DatetimeArray Refactor Oct 18, 2018

This was referenced Oct 22, 2018

ENH: Support EAs in Series.unstack #23284

Merged

Preserve EA dtype in DataFrame.stack #23285

Merged

jorisvandenbossche closed this as completed in #23284 Nov 7, 2018

TomAugspurger moved this from Orthogonal Blockers to Done in DatetimeArray Refactor Dec 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle ExtensionArrays in Series.unstack / DataFrame.stack #23077

Handle ExtensionArrays in Series.unstack / DataFrame.stack #23077

TomAugspurger commented Oct 10, 2018 •

edited

Loading

TomAugspurger commented Oct 18, 2018

TomAugspurger commented Oct 21, 2018

jorisvandenbossche commented Oct 22, 2018

TomAugspurger commented Oct 22, 2018

TomAugspurger commented Oct 22, 2018 •

edited

Loading

Handle ExtensionArrays in Series.unstack / DataFrame.stack #23077

Handle ExtensionArrays in Series.unstack / DataFrame.stack #23077

Comments

TomAugspurger commented Oct 10, 2018 • edited Loading

TomAugspurger commented Oct 18, 2018

TomAugspurger commented Oct 21, 2018

jorisvandenbossche commented Oct 22, 2018

TomAugspurger commented Oct 22, 2018

TomAugspurger commented Oct 22, 2018 • edited Loading

TomAugspurger commented Oct 10, 2018 •

edited

Loading

TomAugspurger commented Oct 22, 2018 •

edited

Loading