BUG: Preserve categorical dtypes in MultiIndex levels (#13743) #13854

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
5 participants
Contributor

pijucha commented Jul 31, 2016 edited

  • closes #13743
  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

This commit modifies MultiIndex.from_array and MultiIndex.from_product.

Example:

cat = pd.Categorical(['a', 'b'], categories=list("bac"), ordered=True)
mi = pd.MultiIndex.from_arrays([cat, cat])

mi.levels[0]
Out[55]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=True, dtype='category')

mi.get_level_values(0)
Out[56]: CategoricalIndex(['a', 'b'], categories=['b', 'a', 'c'], ordered=True, dtype='category')

Previously, the results were:

mi.levels[0]
Out[345]: Index(['b', 'a', 'c'], dtype='object')

mi.get_level_values(0)
Out[346]: Index(['a', 'b'], dtype='object')

This modification makes groupby, pivot, and set_index preserve categorical types in indexes.

codecov-io commented Jul 31, 2016 edited

Current coverage is 85.27% (diff: 100%)

Merging #13854 into master will decrease coverage by <.01%

@@             master     #13854   diff @@
==========================================
  Files           139        139          
  Lines         50555      50561     +6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43111      43116     +5   
- Misses         7444       7445     +1   
  Partials          0          0          

Powered by Codecov. Last update ccec504...99e4a52

@jreback jreback commented on an outdated diff Aug 1, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -855,3 +855,4 @@ Bug Fixes
- Bug in ``.to_excel()`` when DataFrame contains a MultiIndex which contains a label with a NaN value (:issue:`13511`)
- Bug in ``pd.read_csv`` in Python 2.x with non-UTF8 encoded, multi-character separated data (:issue:`3404`)
+- Bug in ``MultiIndex.from_array`` and ``.from_product`` doesn't preserve categorical dtypes in ``MultiIndex`` levels and, consequently, in results of ``groupby`` and ``set_index`` (:issue:`13743`)
@jreback

jreback Aug 1, 2016

Contributor

just say MultiIndex constructor

@jreback jreback and 1 other commented on an outdated diff Aug 1, 2016

pandas/indexes/multi.py
@@ -867,9 +865,7 @@ def from_arrays(cls, arrays, sortorder=None, names=None):
if len(arrays[i]) != len(arrays[i - 1]):
raise ValueError('all arrays must be same length')
- cats = [Categorical.from_array(arr, ordered=True) for arr in arrays]
- levels = [c.categories for c in cats]
- labels = [c.codes for c in cats]
+ labels, levels = zip(*[_factorize(arr) for arr in arrays])
@jreback

jreback Aug 1, 2016

Contributor

this same procedure is used elsewhere in the codebase (e.g. in pytables), IIRC. pls do a search. Maybe wrap this in a nicer function (e.g. arrays -> labels, levels), e.g. what you are doing with _factorize, but even a higher level.

@pijucha

pijucha Aug 2, 2016 edited

Contributor

A quick search gives more results - there're also stack, concat and panel_index and probably some more. I'll try to gather them all in a separate comment.
Maybe this _factorize or/and a higher level _factorize_from_iterables should rather go to another location - core/common.py?

Another idea:
Implementation-wise, at first I was thinking about introducing a boolean parameter to Categorical, say preserve_categorical_dtype or the like. When invoked

c = pd.Categorical(cat, preserve_categorical_dtype=True)

with a categorical array/index/series cat, the c.categories - inherited from cat - would be wrapped in a CategoricalIndex. (This way all changes would be contained in the Categorical constructor and no external function _factorize would be needed.) Would it be an acceptable solution?

@jreback

jreback Aug 2, 2016

Contributor

see my note below.

@jreback jreback commented on an outdated diff Aug 1, 2016

pandas/indexes/multi.py
from pandas.tools.util import cartesian_product
- categoricals = [Categorical.from_array(it, ordered=True)
- for it in iterables]
- labels = cartesian_product([c.codes for c in categoricals])
+ labels, levels = zip(*[_factorize(it) for it in iterables])
@jreback

jreback Aug 1, 2016

Contributor

e.g. doing the same thing here, nicer not to have to zip(*[_factorize(it) for it in iterables]), rather
_factorize_from_iterables (that calls _factorize internally)

@jreback jreback commented on an outdated diff Aug 1, 2016

pandas/indexes/multi.py
+ and `MultiIndex.from_product`.*
+
+ Parameters
+ ----------
+ values : list-like
+
+ Returns
+ -------
+ codes : np.array
+ categories : Index
+ If `values` has a categorical dtype, then `categories` is
+ a CategoricalIndex keeping the categories and order of `values`.
+ """
+ from pandas.core.categorical import Categorical
+ from pandas.indexes.category import CategoricalIndex
+ from pandas.types.common import is_categorical
@jreback

jreback Aug 1, 2016

Contributor

the .types can be top-level imports

@jreback jreback commented on an outdated diff Aug 1, 2016

pandas/tests/test_categorical.py
@@ -1607,6 +1607,62 @@ def test_map(self):
result = c.map(lambda x: 1)
tm.assert_numpy_array_equal(result, np.array([1] * 5, dtype=np.int64))
+ def test_groupby_preserve_dtype(self):
+ df = DataFrame({'A': [1, 2, 1, 1, 2],
@jreback

jreback Aug 1, 2016

Contributor

add the issue as a comment

@sinhrks sinhrks and 1 other commented on an outdated diff Aug 2, 2016

pandas/tests/indexes/test_multi.py
@@ -632,6 +632,24 @@ def test_from_arrays_index_series_period(self):
tm.assert_index_equal(result, result2)
+ def test_from_array_index_series_categorical(self):
@sinhrks

sinhrks Aug 2, 2016

Member

can u add tests for set_levels? Looks work though:

mi = pd.MultiIndex.from_arrays([[1, 2, 3], [4, 5, 6]])
mi2 = mi.set_levels(pd.CategoricalIndex([10, 11, 12]), 0)
mi2.get_level_values(0)
# CategoricalIndex([10, 11, 12], categories=[10, 11, 12], ordered=False, dtype='category')
@pijucha

pijucha Aug 2, 2016

Contributor

I can add a test. But set_levels is fine. The issue is actually with the Categorical constructor - this is where the information about a categorical dtype is lost.

@sinhrks

sinhrks Aug 2, 2016

Member

Yes, I'm caring to guarantee the related APIs, not to be broken accidentally.

Contributor

pijucha commented Aug 2, 2016

Other places where a categorical dtype is lost in similar circumstances.

cidx = pd.CategoricalIndex(['y', 'x'], categories=list("xyz"), ordered=True)
cidx_nonunique = pd.CategoricalIndex(['y', 'x', 'y'], categories=list("xyz"), ordered=True)
  1. concat

    df = pd.DataFrame([[10, 11, 12]])
    
    pd.concat([df, df], keys=cidx).index.levels[0]
    Out[32]: Index(['y', 'x'], dtype='object')
  2. stack with a non-unique index/multi-index:

    df = pd.DataFrame([[10, 11, 12]], columns=cidx_nonunique)
    
    df.stack().index.levels[1]
    Out[35]: Index(['x', 'y', 'z'], dtype='object')
  3. get_dummies

    pd.get_dummies(cidx).columns
    Out[36]: Index(['x', 'y', 'z'], dtype='object')
  4. make_axis_dummies with transform

    df = pd.DataFrame([[10, 11]], columns=cidx)
    ldf = pd.Panel({'A': df, 'B': df}).to_frame()
    
    pd.core.reshape.make_axis_dummies(panel.to_frame(), transform=lambda x: x).columns
    Out[53]: Index(['x', 'y', 'z'], dtype='object')
  5. panel_index

    pi = pd.core.panel.panel_index([0, 1, 2], cidx)
    pi.levels[1]
    Out[57]: Index(['x', 'y'], dtype='object', name='panel')
  6. pytables: LegacyTable.read()
    No quick example yet.

Contributor

jreback commented Aug 2, 2016

@pijucha 5 & 6 you can ignore

Contributor

jreback commented Aug 2, 2016

@pijucha this should exist as a separate function from the Categorical constructor as a private function (but you can put in pandas.core.categorical) is prob not a bad location. maybe _create_categoricals_from_arrays? (or similar). E.g. its a 'categorical' function, but returns labels/levels (and not exactly a cat).

can certainly merge this fix and then do a followup with a more reaching name change. lmk.

@jreback jreback and 1 other commented on an outdated diff Aug 2, 2016

pandas/indexes/multi.py
+ Returns
+ -------
+ codes : np.array
+ categories : Index
+ If `values` has a categorical dtype, then `categories` is
+ a CategoricalIndex keeping the categories and order of `values`.
+ """
+ from pandas.core.categorical import Categorical
+ from pandas.indexes.category import CategoricalIndex
+ from pandas.types.common import is_categorical
+ from pandas.types.generic import ABCCategoricalIndex, ABCSeries
+
+ if is_categorical(values):
+ if isinstance(values, (ABCCategoricalIndex, ABCSeries)):
+ values = values._values
+ categories = CategoricalIndex(values.categories,
@jreback

jreback Aug 2, 2016

Contributor

why is this a CI and not just a Categorial?

@pijucha

pijucha Aug 2, 2016

Contributor

For consistency. The else part returns cat.categories, which is an Index.

Contributor

pijucha commented Aug 2, 2016

@jreback OK, Thanks.

Contributor

pijucha commented Aug 17, 2016

Sorry for a bit of delay. I fixed stack, get_dummies, make_axis_dummies (2-4 in the list above) and opened a separate issue for concat (and for two other issues I came across).

@jreback jreback commented on an outdated diff Aug 17, 2016

pandas/tests/test_categorical.py
@@ -1607,6 +1607,113 @@ def test_map(self):
result = c.map(lambda x: 1)
tm.assert_numpy_array_equal(result, np.array([1] * 5, dtype=np.int64))
+ def test_groupby_preserve_dtype(self):
@jreback

jreback Aug 17, 2016 edited

Contributor

move to test_groupby (for the groupby tests)

@jreback jreback and 1 other commented on an outdated diff Aug 17, 2016

pandas/tests/test_categorical.py
+ def test_set_index_preserve_dtype(self):
+ # GH13743, GH13854
+ df = DataFrame({'A': [1, 2, 1, 1, 2],
+ 'B': [10, 16, 22, 28, 34],
+ 'C1': pd.Categorical(list("abaab"),
+ categories=list("bac"),
+ ordered=False),
+ 'C2': pd.Categorical(list("abaab"),
+ categories=list("bac"),
+ ordered=True)})
+ for cols in ['C1', 'C2', ['A', 'C1'], ['A', 'C2'], ['C1', 'C2']]:
+ result = df.set_index(cols).reset_index()
+ result = result.reindex(columns=df.columns)
+ tm.assert_frame_equal(result, df)
+
+ def test_stack_preserve_dtype(self):
@jreback

jreback Aug 17, 2016

Contributor

move to test_reshape

@pijucha

pijucha Aug 17, 2016

Contributor

@jreback
Just to clarify (test files are a bit confusing):

  • test_stack_preserve_dtype goes to tests/frame/test_reshape.py (other stack tests there)
  • test_get_dummies_preserve_dtype to tests/test_reshape.py (class TestGetDummies there)

Not sure about:

  • test_set_index_preserve_dtype: probably to tests/frame/test_alter_axes.py (most of test_set_index* there) but some tests are also in test_multilevel.py
  • test_make_axis_dummies_preserve_dtype: can also go to tests/test_reshape.py (if so then probably in a separate class) but all make_axis_dummies tests are in test_panel.py (I don't use a Panel here because of the issue with Panel.to_frame())

@jreback jreback commented on an outdated diff Aug 17, 2016

pandas/tests/test_categorical.py
+ # GH13854
+ for ordered in [False, True]:
+ for labels in [list("yxz"), list("yxy")]:
+ cidx = pd.CategoricalIndex(labels, categories=list("xyz"),
+ ordered=ordered)
+ df = DataFrame([[10, 11, 12]], columns=cidx)
+ result = df.stack()
+
+ # `MutliIndex.from_product` preserves categorical dtype -
+ # it's tested elsewhere.
+ midx = pd.MultiIndex.from_product([df.index, cidx])
+ expected = Series([10, 11, 12], index=midx)
+
+ tm.assert_series_equal(result, expected)
+
+ def test_get_dummies_preserve_dtype(self):
@jreback

jreback Aug 17, 2016

Contributor

same here

@jreback jreback commented on an outdated diff Aug 17, 2016

doc/source/whatsnew/v0.19.0.txt
@@ -996,3 +996,5 @@ Bug Fixes
- Bug in ``Index`` raises ``KeyError`` displaying incorrect column when column is not in the df and columns contains duplicate values (:issue:`13822`)
- Bug in ``Period`` and ``PeriodIndex`` creating wrong dates when frequency has combined offset aliases (:issue:`13874`)
- Bug in ``.to_string()`` when called with an integer ``line_width`` and ``index=False`` raises an UnboundLocalError exception because ``idx`` referenced before assignment.
+- Bug in ``MultiIndex`` constructor doesn't preserve categorical dtypes in ``MultiIndex`` levels and, consequently, in results of ``groupby`` and ``set_index`` (:issue:`13743`)
@jreback

jreback Aug 17, 2016

Contributor

move to a separate sub-section with a small example showing previous / new

Contributor

jreback commented Aug 17, 2016

small changes. looks pretty good.

jorisvandenbossche added this to the 0.19.0 milestone Aug 21, 2016

Contributor

pijucha commented Aug 22, 2016

I moved some tests to files I thought were more appropriate (though, I'm not 100% sure).

Contributor

pijucha commented Aug 23, 2016

update

@jreback jreback and 1 other commented on an outdated diff Aug 25, 2016

doc/source/whatsnew/v0.19.0.txt
+ In [4]: midx.levels[0]
+ Out[4]: Index(['b', 'a', 'c'], dtype='object')
+
+ In [5]: midx.get_level_values[0]
+ Out[5]: Index(['a', 'b'], dtype='object')
+
+New Behavior:
+
+.. ipython:: python
+
+ midx.levels[0]
+ midx.get_level_values(0)
+
+An analogous change has been made to ``MultiIndex.from_product``.
+As a consequence, ``groupby`` and ``set_index`` also preserve categorical dtypes in indexes (:issue:`13743`)
+
@jreback

jreback Aug 25, 2016

Contributor

can you add a mini-example here for this as well

@pijucha

pijucha Aug 25, 2016

Contributor

For groupby and set_index? Sure.

@jreback

jreback Aug 26, 2016

Contributor

yep

@jreback jreback commented on the diff Aug 25, 2016

pandas/indexes/multi.py
@@ -864,9 +862,9 @@ def from_arrays(cls, arrays, sortorder=None, names=None):
if len(arrays[i]) != len(arrays[i - 1]):
raise ValueError('all arrays must be same length')
- cats = [Categorical.from_array(arr, ordered=True) for arr in arrays]
@jreback

jreback Aug 25, 2016

Contributor

do we use Categorical.from_array any longer? (in the codebase)

@pijucha

pijucha Aug 25, 2016

Contributor

There are just a few places: internals of concat and unstack, LegacyTable in pytables, and panel_index. I can replace them all with _factorize - it's pretty straightforward. It wouldn't automatically fix concat and unstack (that's why I opened separate issues for them) but wouldn't hurt either.

If we replace them, what should be done to the definition of Categorical.from_array? Remove completely or rather add a deprecation warning?

@jreback

jreback Aug 26, 2016

Contributor

ideally we should replace these and deprecate the constructor. but can do that later / another PR if desired.

@pijucha

pijucha Aug 26, 2016

Contributor

OK. I'll try to do it today if it goes smoothly. Otherwise, I'll do a follow up as soon as this PR is merged in.

Just a question:
Categorical.from_array should emit a FutureWarning with a comment like this: "Categorical.from_array is deprecated, use Categorical instead"?

@jreback

jreback Aug 26, 2016

Contributor

yes

@jreback jreback and 1 other commented on an outdated diff Aug 26, 2016

pandas/core/categorical.py
@@ -1956,3 +1957,46 @@ def _convert_to_list_like(list_like):
else:
# is this reached?
return [list_like]
+
+
+def _factorize(values):
@jreback

jreback Aug 26, 2016 edited

Contributor

per my comment below, I think we should name this:

_factorize_from_iterable
_factorize_from_iterables

I think maybe more clear.

@pijucha

pijucha Aug 26, 2016

Contributor

ok

Contributor

jreback commented Aug 26, 2016

@pijucha test split looks good.

@sinhrks any comments?

Contributor

pijucha commented Aug 27, 2016

Deprecated Categorical.from_array.

Contributor

pijucha commented Aug 31, 2016

update

jsexauer referenced this pull request Aug 31, 2016

Open

DEPR: deprecations from prior versions #6581

0 of 51 tasks complete
Contributor

jreback commented Aug 31, 2016

jreback referenced this pull request Aug 31, 2016

Closed

RLS: 0.19.0 #13991

@jorisvandenbossche jorisvandenbossche modified the milestone: 0.19.0, 0.19.0rc Sep 1, 2016

Contributor

jreback commented Sep 2, 2016

@pijucha can you rebase

@pijucha pijucha BUG/DEPR: Categorical: keep dtype in MultiIndex (#13743), deprecate .…
…from_array

Now, categorical dtype is preserved also in `groupby`, `set_index`, `stack`,
`get_dummies`, and `make_axis_dummies`.
99e4a52
Contributor

pijucha commented Sep 2, 2016

@jreback Done (rebase + small update to tests/test_reshape.py).

One build on travis has probably stalled - should I resubmit?

Contributor

jreback commented Sep 2, 2016

@pijucha i restarted the build. ping when green.

jreback closed this in d26363b Sep 2, 2016

Contributor

jreback commented Sep 2, 2016

thanks @pijucha really nice PR. touches lots of parts!

pijucha deleted the pijucha:catdtype branch Sep 4, 2016

@jorisvandenbossche jorisvandenbossche modified the milestone: 0.19.0rc, 0.19.0 Sep 7, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment