Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle ExtensionArrays in Series.unstack / DataFrame.stack #23077

Closed
TomAugspurger opened this issue Oct 10, 2018 · 5 comments

Comments

2 participants
@TomAugspurger
Copy link
Contributor

commented Oct 10, 2018

Here's a test for Series.unstack

diff --git a/pandas/tests/extension/base/reshaping.py b/pandas/tests/extension/base/reshaping.py
index 0340289e0..a040fba63 100644
--- a/pandas/tests/extension/base/reshaping.py
+++ b/pandas/tests/extension/base/reshaping.py
@@ -171,3 +171,16 @@ class BaseReshapingTests(BaseExtensionTests):
                  [data[0], data[0], data[1], data[2], na_value],
                  dtype=data.dtype)})
         self.assert_frame_equal(res, exp[['ext', 'int1', 'key', 'int2']])
+
+    def test_unstack(self, data):
+        data = data[:4]
+        ser = pd.Series(
+            data,
+            index=pd.MultiIndex.from_product([["A", "B"],
+                                              ["a", "b"]]),
+        )
+        result = ser.unstack()
+        expected = pd.DataFrame({"a": data.take([0, 2]),
+                                 "b": data.take([1, 3])},
+                                index=['A', 'B'])
+        self.assert_frame_equal(result, expected)

We don't do so well right now. Just categorical passes.

========================================================================= FAILURES =========================================================================
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:112: IndexError: tuple index out of range
/Users/taugspurger/sandbox/pandas/pandas/core/reshape/reshape.py:204: AttributeError: 'IntervalArray' object has no attribute 'reshape'
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/decimal/array.py:55: TypeError: All values must be of type <class 'decimal.Decimal'>
/Users/taugspurger/sandbox/pandas/pandas/tests/extension/json/array.py:88: TypeError: list indices must be integers or slices, not NoneType

No test for DataFrame.stack. In https://github.com/pandas-dev/pandas/pull/22862/files there's a WIP for stack that's based around ExtensionArray._concat_same_type instead of .reshape

@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Oct 10, 2018

@TomAugspurger TomAugspurger added this to Orthogonal Blockers in DatetimeArray Refactor Oct 18, 2018

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 18, 2018

I'm working on this now.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 21, 2018

I have a WIP for Series[extension_array].unstack at master...TomAugspurger:ea-unstack

We can write unstack as a composition of

  • loc (take)
  • reindex
  • concat

It seems to work... but it's quite slow. If you're reshaping to (n_rows, n_columns), you'll end up with n_columns locs and reindexes each.

We should be able to improve the (maybe common?) case of a "uniform" index (like what you get from from_product), without too much additional code. That's just a bunch of uniform takes with no reindexes.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Oct 22, 2018

@TomAugspurger is it slower compared to the current implementation for non-consolidatable blocks? Or how does it currently work for them?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 22, 2018

DataFrame.stack() has a regression from 0.23.4

In [3]: df = pd.DataFrame({"A": pd.Categorical(['a', 'b']), "B": pd.Categorical(['a', 'b'])})

In [4]: df.stack()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-c3a581ae119f> in <module>
----> 1 df.stack()

~/sandbox/pandas/pandas/core/frame.py in stack(self, level, dropna)
   5753             return stack_multiple(self, level, dropna=dropna)
   5754         else:
-> 5755             return stack(self, level, dropna=dropna)
   5756
   5757     def unstack(self, level=-1, fill_value=None):

~/sandbox/pandas/pandas/core/reshape/reshape.py in stack(frame, level, dropna)
    471             arr = dtype.construct_array_type()
    472             new_values = arr._concat_same_type([
--> 473                 col for _, col in frame.iteritems()
    474             ])
    475         else:

~/sandbox/pandas/pandas/core/arrays/categorical.py in _concat_same_type(self, to_concat)
   2286         from pandas.core.dtypes.concat import _concat_categorical
   2287
-> 2288         return _concat_categorical(to_concat)
   2289
   2290     def _formatting_values(self):

~/sandbox/pandas/pandas/core/dtypes/concat.py in _concat_categorical(to_concat, axis)
    239         # when all categories are identical
    240         first = to_concat[0]
--> 241         if all(first.is_dtype_equal(other) for other in to_concat[1:]):
    242             return union_categoricals(categoricals)
    243

~/sandbox/pandas/pandas/core/dtypes/concat.py in <genexpr>(.0)
    239         # when all categories are identical
    240         first = to_concat[0]
--> 241         if all(first.is_dtype_equal(other) for other in to_concat[1:]):
    242             return union_categoricals(categoricals)
    243

~/sandbox/pandas/pandas/core/generic.py in __getattr__(self, name)
   4636             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   4637                 return self[name]
-> 4638             return object.__getattribute__(self, name)
   4639
   4640     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'is_dtype_equal'

previously that returned an object-dtype Series. Ideally, it would be a categorical.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 22, 2018

@jorisvandenbossche I have a new implementation that lowers the overhead for EAs. Will post the timings in the PR (once I've written them).

It'll basically be identical to previously, but

  1. ExtensionArrays are no longer cast to ndarrays (good)
  2. An extra ExtensionArray.take per column + one extra concat (bad)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.