Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Preserve categorical dtypes in MultiIndex levels (#13743) #13854

Closed
wants to merge 1 commit into from

Conversation

Projects
None yet
5 participants
@pijucha
Copy link
Contributor

commented Jul 31, 2016

  • closes #13743
  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

This commit modifies MultiIndex.from_array and MultiIndex.from_product.

Example:

cat = pd.Categorical(['a', 'b'], categories=list("bac"), ordered=True)
mi = pd.MultiIndex.from_arrays([cat, cat])

mi.levels[0]
Out[55]: CategoricalIndex(['b', 'a', 'c'], categories=['b', 'a', 'c'], ordered=True, dtype='category')

mi.get_level_values(0)
Out[56]: CategoricalIndex(['a', 'b'], categories=['b', 'a', 'c'], ordered=True, dtype='category')

Previously, the results were:

mi.levels[0]
Out[345]: Index(['b', 'a', 'c'], dtype='object')

mi.get_level_values(0)
Out[346]: Index(['a', 'b'], dtype='object')

This modification makes groupby, pivot, and set_index preserve categorical types in indexes.

@codecov-io

This comment has been minimized.

Copy link

commented Jul 31, 2016

Current coverage is 85.27% (diff: 100%)

Merging #13854 into master will decrease coverage by <.01%

@@             master     #13854   diff @@
==========================================
  Files           139        139          
  Lines         50555      50561     +6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43111      43116     +5   
- Misses         7444       7445     +1   
  Partials          0          0          

Powered by Codecov. Last update ccec504...99e4a52

@jreback

View changes

doc/source/whatsnew/v0.19.0.txt Outdated
@@ -855,3 +855,4 @@ Bug Fixes

- Bug in ``.to_excel()`` when DataFrame contains a MultiIndex which contains a label with a NaN value (:issue:`13511`)
- Bug in ``pd.read_csv`` in Python 2.x with non-UTF8 encoded, multi-character separated data (:issue:`3404`)
- Bug in ``MultiIndex.from_array`` and ``.from_product`` doesn't preserve categorical dtypes in ``MultiIndex`` levels and, consequently, in results of ``groupby`` and ``set_index`` (:issue:`13743`)

This comment has been minimized.

Copy link
@jreback

jreback Aug 1, 2016

Contributor

just say MultiIndex constructor

@jreback

View changes

pandas/indexes/multi.py Outdated
cats = [Categorical.from_array(arr, ordered=True) for arr in arrays]
levels = [c.categories for c in cats]
labels = [c.codes for c in cats]
labels, levels = zip(*[_factorize(arr) for arr in arrays])

This comment has been minimized.

Copy link
@jreback

jreback Aug 1, 2016

Contributor

this same procedure is used elsewhere in the codebase (e.g. in pytables), IIRC. pls do a search. Maybe wrap this in a nicer function (e.g. arrays -> labels, levels), e.g. what you are doing with _factorize, but even a higher level.

This comment has been minimized.

Copy link
@pijucha

pijucha Aug 2, 2016

Author Contributor

A quick search gives more results - there're also stack, concat and panel_index and probably some more. I'll try to gather them all in a separate comment.
Maybe this _factorize or/and a higher level _factorize_from_iterables should rather go to another location - core/common.py?

Another idea:
Implementation-wise, at first I was thinking about introducing a boolean parameter to Categorical, say preserve_categorical_dtype or the like. When invoked

c = pd.Categorical(cat, preserve_categorical_dtype=True)

with a categorical array/index/series cat, the c.categories - inherited from cat - would be wrapped in a CategoricalIndex. (This way all changes would be contained in the Categorical constructor and no external function _factorize would be needed.) Would it be an acceptable solution?

This comment has been minimized.

Copy link
@jreback

jreback Aug 2, 2016

Contributor

see my note below.

@jreback

View changes

pandas/indexes/multi.py Outdated
categoricals = [Categorical.from_array(it, ordered=True)
for it in iterables]
labels = cartesian_product([c.codes for c in categoricals])
labels, levels = zip(*[_factorize(it) for it in iterables])

This comment has been minimized.

Copy link
@jreback

jreback Aug 1, 2016

Contributor

e.g. doing the same thing here, nicer not to have to zip(*[_factorize(it) for it in iterables]), rather
_factorize_from_iterables (that calls _factorize internally)

@jreback

View changes

pandas/indexes/multi.py Outdated
"""
from pandas.core.categorical import Categorical
from pandas.indexes.category import CategoricalIndex
from pandas.types.common import is_categorical

This comment has been minimized.

Copy link
@jreback

jreback Aug 1, 2016

Contributor

the .types can be top-level imports

@jreback

View changes

pandas/tests/test_categorical.py Outdated
@@ -1607,6 +1607,62 @@ def test_map(self):
result = c.map(lambda x: 1)
tm.assert_numpy_array_equal(result, np.array([1] * 5, dtype=np.int64))

def test_groupby_preserve_dtype(self):
df = DataFrame({'A': [1, 2, 1, 1, 2],

This comment has been minimized.

Copy link
@jreback

jreback Aug 1, 2016

Contributor

add the issue as a comment

@sinhrks

View changes

pandas/tests/indexes/test_multi.py Outdated
@@ -632,6 +632,24 @@ def test_from_arrays_index_series_period(self):

tm.assert_index_equal(result, result2)

def test_from_array_index_series_categorical(self):

This comment has been minimized.

Copy link
@sinhrks

sinhrks Aug 2, 2016

Member

can u add tests for set_levels? Looks work though:

mi = pd.MultiIndex.from_arrays([[1, 2, 3], [4, 5, 6]])
mi2 = mi.set_levels(pd.CategoricalIndex([10, 11, 12]), 0)
mi2.get_level_values(0)
# CategoricalIndex([10, 11, 12], categories=[10, 11, 12], ordered=False, dtype='category')

This comment has been minimized.

Copy link
@pijucha

pijucha Aug 2, 2016

Author Contributor

I can add a test. But set_levels is fine. The issue is actually with the Categorical constructor - this is where the information about a categorical dtype is lost.

This comment has been minimized.

Copy link
@sinhrks

sinhrks Aug 2, 2016

Member

Yes, I'm caring to guarantee the related APIs, not to be broken accidentally.

@pijucha

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2016

Other places where a categorical dtype is lost in similar circumstances.

cidx = pd.CategoricalIndex(['y', 'x'], categories=list("xyz"), ordered=True)
cidx_nonunique = pd.CategoricalIndex(['y', 'x', 'y'], categories=list("xyz"), ordered=True)
  1. concat

    df = pd.DataFrame([[10, 11, 12]])
    
    pd.concat([df, df], keys=cidx).index.levels[0]
    Out[32]: Index(['y', 'x'], dtype='object')
  2. stack with a non-unique index/multi-index:

    df = pd.DataFrame([[10, 11, 12]], columns=cidx_nonunique)
    
    df.stack().index.levels[1]
    Out[35]: Index(['x', 'y', 'z'], dtype='object')
  3. get_dummies

    pd.get_dummies(cidx).columns
    Out[36]: Index(['x', 'y', 'z'], dtype='object')
  4. make_axis_dummies with transform

    df = pd.DataFrame([[10, 11]], columns=cidx)
    ldf = pd.Panel({'A': df, 'B': df}).to_frame()
    
    pd.core.reshape.make_axis_dummies(panel.to_frame(), transform=lambda x: x).columns
    Out[53]: Index(['x', 'y', 'z'], dtype='object')
  5. panel_index

    pi = pd.core.panel.panel_index([0, 1, 2], cidx)
    pi.levels[1]
    Out[57]: Index(['x', 'y'], dtype='object', name='panel')
  6. pytables: LegacyTable.read()
    No quick example yet.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2016

@pijucha 5 & 6 you can ignore

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2016

@pijucha this should exist as a separate function from the Categorical constructor as a private function (but you can put in pandas.core.categorical) is prob not a bad location. maybe _create_categoricals_from_arrays? (or similar). E.g. its a 'categorical' function, but returns labels/levels (and not exactly a cat).

can certainly merge this fix and then do a followup with a more reaching name change. lmk.

@jreback

View changes

pandas/indexes/multi.py Outdated
if is_categorical(values):
if isinstance(values, (ABCCategoricalIndex, ABCSeries)):
values = values._values
categories = CategoricalIndex(values.categories,

This comment has been minimized.

Copy link
@jreback

jreback Aug 2, 2016

Contributor

why is this a CI and not just a Categorial?

This comment has been minimized.

Copy link
@pijucha

pijucha Aug 2, 2016

Author Contributor

For consistency. The else part returns cat.categories, which is an Index.

@pijucha

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2016

@jreback OK, Thanks.

@pijucha

This comment has been minimized.

Copy link
Contributor Author

commented Aug 17, 2016

Sorry for a bit of delay. I fixed stack, get_dummies, make_axis_dummies (2-4 in the list above) and opened a separate issue for concat (and for two other issues I came across).

@jreback

View changes

pandas/tests/test_categorical.py Outdated
@@ -1607,6 +1607,113 @@ def test_map(self):
result = c.map(lambda x: 1)
tm.assert_numpy_array_equal(result, np.array([1] * 5, dtype=np.int64))

def test_groupby_preserve_dtype(self):

This comment has been minimized.

Copy link
@jreback

jreback Aug 17, 2016

Contributor

move to test_groupby (for the groupby tests)

@jreback

View changes

pandas/tests/test_categorical.py Outdated
result = result.reindex(columns=df.columns)
tm.assert_frame_equal(result, df)

def test_stack_preserve_dtype(self):

This comment has been minimized.

Copy link
@jreback

jreback Aug 17, 2016

Contributor

move to test_reshape

This comment has been minimized.

Copy link
@pijucha

pijucha Aug 17, 2016

Author Contributor

@jreback
Just to clarify (test files are a bit confusing):

  • test_stack_preserve_dtype goes to tests/frame/test_reshape.py (other stack tests there)
  • test_get_dummies_preserve_dtype to tests/test_reshape.py (class TestGetDummies there)

Not sure about:

  • test_set_index_preserve_dtype: probably to tests/frame/test_alter_axes.py (most of test_set_index* there) but some tests are also in test_multilevel.py
  • test_make_axis_dummies_preserve_dtype: can also go to tests/test_reshape.py (if so then probably in a separate class) but all make_axis_dummies tests are in test_panel.py (I don't use a Panel here because of the issue with Panel.to_frame())
@jreback

View changes

pandas/tests/test_categorical.py Outdated

tm.assert_series_equal(result, expected)

def test_get_dummies_preserve_dtype(self):

This comment has been minimized.

Copy link
@jreback

jreback Aug 17, 2016

Contributor

same here

@jreback

View changes

doc/source/whatsnew/v0.19.0.txt Outdated
@@ -996,3 +996,5 @@ Bug Fixes
- Bug in ``Index`` raises ``KeyError`` displaying incorrect column when column is not in the df and columns contains duplicate values (:issue:`13822`)
- Bug in ``Period`` and ``PeriodIndex`` creating wrong dates when frequency has combined offset aliases (:issue:`13874`)
- Bug in ``.to_string()`` when called with an integer ``line_width`` and ``index=False`` raises an UnboundLocalError exception because ``idx`` referenced before assignment.
- Bug in ``MultiIndex`` constructor doesn't preserve categorical dtypes in ``MultiIndex`` levels and, consequently, in results of ``groupby`` and ``set_index`` (:issue:`13743`)

This comment has been minimized.

Copy link
@jreback

jreback Aug 17, 2016

Contributor

move to a separate sub-section with a small example showing previous / new

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 17, 2016

small changes. looks pretty good.

@jorisvandenbossche jorisvandenbossche added this to the 0.19.0 milestone Aug 21, 2016

@pijucha pijucha force-pushed the pijucha:catdtype branch Aug 21, 2016

@pijucha

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2016

I moved some tests to files I thought were more appropriate (though, I'm not 100% sure).

@@ -864,9 +862,9 @@ def from_arrays(cls, arrays, sortorder=None, names=None):
if len(arrays[i]) != len(arrays[i - 1]):
raise ValueError('all arrays must be same length')

cats = [Categorical.from_array(arr, ordered=True) for arr in arrays]

This comment has been minimized.

Copy link
@jreback

jreback Aug 25, 2016

Contributor

do we use Categorical.from_array any longer? (in the codebase)

This comment has been minimized.

Copy link
@pijucha

pijucha Aug 25, 2016

Author Contributor

There are just a few places: internals of concat and unstack, LegacyTable in pytables, and panel_index. I can replace them all with _factorize - it's pretty straightforward. It wouldn't automatically fix concat and unstack (that's why I opened separate issues for them) but wouldn't hurt either.

If we replace them, what should be done to the definition of Categorical.from_array? Remove completely or rather add a deprecation warning?

This comment has been minimized.

Copy link
@jreback

jreback Aug 26, 2016

Contributor

ideally we should replace these and deprecate the constructor. but can do that later / another PR if desired.

This comment has been minimized.

Copy link
@pijucha

pijucha Aug 26, 2016

Author Contributor

OK. I'll try to do it today if it goes smoothly. Otherwise, I'll do a follow up as soon as this PR is merged in.

Just a question:
Categorical.from_array should emit a FutureWarning with a comment like this: "Categorical.from_array is deprecated, use Categorical instead"?

This comment has been minimized.

Copy link
@jreback

jreback Aug 26, 2016

Contributor

yes

@jreback

View changes

pandas/core/categorical.py Outdated
@@ -1956,3 +1957,46 @@ def _convert_to_list_like(list_like):
else:
# is this reached?
return [list_like]


def _factorize(values):

This comment has been minimized.

Copy link
@jreback

jreback Aug 26, 2016

Contributor

per my comment below, I think we should name this:

_factorize_from_iterable
_factorize_from_iterables

I think maybe more clear.

This comment has been minimized.

Copy link
@pijucha

pijucha Aug 26, 2016

Author Contributor

ok

@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2016

@pijucha test split looks good.

@sinhrks any comments?

@pijucha pijucha force-pushed the pijucha:catdtype branch Aug 27, 2016

@pijucha

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2016

Deprecated Categorical.from_array.

@pijucha pijucha force-pushed the pijucha:catdtype branch Aug 31, 2016

@pijucha

This comment has been minimized.

Copy link
Contributor Author

commented Aug 31, 2016

update

@jsexauer jsexauer referenced this pull request Aug 31, 2016

Open

DEPR: deprecations from prior versions #6581

0 of 98 tasks complete
@jreback

This comment has been minimized.

Copy link
Contributor

commented Aug 31, 2016

@jreback jreback referenced this pull request Aug 31, 2016

Closed

RLS: 0.19.0 #13991

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0, 0.19.0rc Sep 1, 2016

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 2, 2016

@pijucha can you rebase

BUG/DEPR: Categorical: keep dtype in MultiIndex (#13743), deprecate .…
…from_array

Now, categorical dtype is preserved also in `groupby`, `set_index`, `stack`,
`get_dummies`, and `make_axis_dummies`.

@pijucha pijucha force-pushed the pijucha:catdtype branch to 99e4a52 Sep 2, 2016

@pijucha

This comment has been minimized.

Copy link
Contributor Author

commented Sep 2, 2016

@jreback Done (rebase + small update to tests/test_reshape.py).

One build on travis has probably stalled - should I resubmit?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 2, 2016

@pijucha i restarted the build. ping when green.

@jreback jreback closed this in d26363b Sep 2, 2016

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 2, 2016

thanks @pijucha really nice PR. touches lots of parts!

@pijucha pijucha deleted the pijucha:catdtype branch Sep 4, 2016

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0rc, 0.19.0 Sep 7, 2016

gfyoung added a commit to forking-repos/pandas that referenced this pull request Dec 5, 2017

@jreback jreback referenced this pull request Dec 5, 2017

Open

DEPR: deprecations log for removed issues #13777

116 of 116 tasks complete

jreback added a commit that referenced this pull request Dec 5, 2017

CLN: Remove Categorical.from_array (#18642)
Deprecated in 0.19.0

xref gh-13854.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.