Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
BUG: Preserve categorical dtypes in MultiIndex levels (#13743) #13854
Conversation
codecov-io
commented
Jul 31, 2016
•
Current coverage is 85.27% (diff: 100%)@@ master #13854 diff @@
==========================================
Files 139 139
Lines 50555 50561 +6
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43111 43116 +5
- Misses 7444 7445 +1
Partials 0 0
|
jreback
added Bug Groupby MultiIndex Categorical
labels
Aug 1, 2016
jreback
commented on an outdated diff
Aug 1, 2016
| @@ -855,3 +855,4 @@ Bug Fixes | ||
| - Bug in ``.to_excel()`` when DataFrame contains a MultiIndex which contains a label with a NaN value (:issue:`13511`) | ||
| - Bug in ``pd.read_csv`` in Python 2.x with non-UTF8 encoded, multi-character separated data (:issue:`3404`) | ||
| +- Bug in ``MultiIndex.from_array`` and ``.from_product`` doesn't preserve categorical dtypes in ``MultiIndex`` levels and, consequently, in results of ``groupby`` and ``set_index`` (:issue:`13743`) |
|
|
jreback
and 1 other
commented on an outdated diff
Aug 1, 2016
| @@ -867,9 +865,7 @@ def from_arrays(cls, arrays, sortorder=None, names=None): | ||
| if len(arrays[i]) != len(arrays[i - 1]): | ||
| raise ValueError('all arrays must be same length') | ||
| - cats = [Categorical.from_array(arr, ordered=True) for arr in arrays] | ||
| - levels = [c.categories for c in cats] | ||
| - labels = [c.codes for c in cats] | ||
| + labels, levels = zip(*[_factorize(arr) for arr in arrays]) |
jreback
Contributor
|
jreback
commented on an outdated diff
Aug 1, 2016
| from pandas.tools.util import cartesian_product | ||
| - categoricals = [Categorical.from_array(it, ordered=True) | ||
| - for it in iterables] | ||
| - labels = cartesian_product([c.codes for c in categoricals]) | ||
| + labels, levels = zip(*[_factorize(it) for it in iterables]) |
jreback
Contributor
|
jreback
commented on an outdated diff
Aug 1, 2016
| + and `MultiIndex.from_product`.* | ||
| + | ||
| + Parameters | ||
| + ---------- | ||
| + values : list-like | ||
| + | ||
| + Returns | ||
| + ------- | ||
| + codes : np.array | ||
| + categories : Index | ||
| + If `values` has a categorical dtype, then `categories` is | ||
| + a CategoricalIndex keeping the categories and order of `values`. | ||
| + """ | ||
| + from pandas.core.categorical import Categorical | ||
| + from pandas.indexes.category import CategoricalIndex | ||
| + from pandas.types.common import is_categorical |
|
|
jreback
commented on an outdated diff
Aug 1, 2016
sinhrks
and 1 other
commented on an outdated diff
Aug 2, 2016
| @@ -632,6 +632,24 @@ def test_from_arrays_index_series_period(self): | ||
| tm.assert_index_equal(result, result2) | ||
| + def test_from_array_index_series_categorical(self): |
sinhrks
Member
|
|
Other places where a cidx = pd.CategoricalIndex(['y', 'x'], categories=list("xyz"), ordered=True)
cidx_nonunique = pd.CategoricalIndex(['y', 'x', 'y'], categories=list("xyz"), ordered=True)
|
|
@pijucha 5 & 6 you can ignore |
|
@pijucha this should exist as a separate function from the can certainly merge this fix and then do a followup with a more reaching name change. lmk. |
jreback
and 1 other
commented on an outdated diff
Aug 2, 2016
| + Returns | ||
| + ------- | ||
| + codes : np.array | ||
| + categories : Index | ||
| + If `values` has a categorical dtype, then `categories` is | ||
| + a CategoricalIndex keeping the categories and order of `values`. | ||
| + """ | ||
| + from pandas.core.categorical import Categorical | ||
| + from pandas.indexes.category import CategoricalIndex | ||
| + from pandas.types.common import is_categorical | ||
| + from pandas.types.generic import ABCCategoricalIndex, ABCSeries | ||
| + | ||
| + if is_categorical(values): | ||
| + if isinstance(values, (ABCCategoricalIndex, ABCSeries)): | ||
| + values = values._values | ||
| + categories = CategoricalIndex(values.categories, |
pijucha
Contributor
|
|
@jreback OK, Thanks. |
This was referenced Aug 16, 2016
|
Sorry for a bit of delay. I fixed |
jreback
commented on an outdated diff
Aug 17, 2016
jreback
and 1 other
commented on an outdated diff
Aug 17, 2016
| + def test_set_index_preserve_dtype(self): | ||
| + # GH13743, GH13854 | ||
| + df = DataFrame({'A': [1, 2, 1, 1, 2], | ||
| + 'B': [10, 16, 22, 28, 34], | ||
| + 'C1': pd.Categorical(list("abaab"), | ||
| + categories=list("bac"), | ||
| + ordered=False), | ||
| + 'C2': pd.Categorical(list("abaab"), | ||
| + categories=list("bac"), | ||
| + ordered=True)}) | ||
| + for cols in ['C1', 'C2', ['A', 'C1'], ['A', 'C2'], ['C1', 'C2']]: | ||
| + result = df.set_index(cols).reset_index() | ||
| + result = result.reindex(columns=df.columns) | ||
| + tm.assert_frame_equal(result, df) | ||
| + | ||
| + def test_stack_preserve_dtype(self): |
pijucha
Contributor
|
jreback
commented on an outdated diff
Aug 17, 2016
| + # GH13854 | ||
| + for ordered in [False, True]: | ||
| + for labels in [list("yxz"), list("yxy")]: | ||
| + cidx = pd.CategoricalIndex(labels, categories=list("xyz"), | ||
| + ordered=ordered) | ||
| + df = DataFrame([[10, 11, 12]], columns=cidx) | ||
| + result = df.stack() | ||
| + | ||
| + # `MutliIndex.from_product` preserves categorical dtype - | ||
| + # it's tested elsewhere. | ||
| + midx = pd.MultiIndex.from_product([df.index, cidx]) | ||
| + expected = Series([10, 11, 12], index=midx) | ||
| + | ||
| + tm.assert_series_equal(result, expected) | ||
| + | ||
| + def test_get_dummies_preserve_dtype(self): |
|
|
jreback
commented on an outdated diff
Aug 17, 2016
| @@ -996,3 +996,5 @@ Bug Fixes | ||
| - Bug in ``Index`` raises ``KeyError`` displaying incorrect column when column is not in the df and columns contains duplicate values (:issue:`13822`) | ||
| - Bug in ``Period`` and ``PeriodIndex`` creating wrong dates when frequency has combined offset aliases (:issue:`13874`) | ||
| - Bug in ``.to_string()`` when called with an integer ``line_width`` and ``index=False`` raises an UnboundLocalError exception because ``idx`` referenced before assignment. | ||
| +- Bug in ``MultiIndex`` constructor doesn't preserve categorical dtypes in ``MultiIndex`` levels and, consequently, in results of ``groupby`` and ``set_index`` (:issue:`13743`) |
jreback
Contributor
|
|
small changes. looks pretty good. |
jreback
referenced
this pull request
Aug 21, 2016
Closed
BUG: Categoricals shouldn't allow non-strings when object dtype is passed (#13919) #14047
jorisvandenbossche
added this to the
0.19.0
milestone
Aug 21, 2016
|
I moved some tests to files I thought were more appropriate (though, I'm not 100% sure). |
|
update |
jreback
and 1 other
commented on an outdated diff
Aug 25, 2016
| + In [4]: midx.levels[0] | ||
| + Out[4]: Index(['b', 'a', 'c'], dtype='object') | ||
| + | ||
| + In [5]: midx.get_level_values[0] | ||
| + Out[5]: Index(['a', 'b'], dtype='object') | ||
| + | ||
| +New Behavior: | ||
| + | ||
| +.. ipython:: python | ||
| + | ||
| + midx.levels[0] | ||
| + midx.get_level_values(0) | ||
| + | ||
| +An analogous change has been made to ``MultiIndex.from_product``. | ||
| +As a consequence, ``groupby`` and ``set_index`` also preserve categorical dtypes in indexes (:issue:`13743`) | ||
| + |
|
|
jreback
commented on the diff
Aug 25, 2016
| @@ -864,9 +862,9 @@ def from_arrays(cls, arrays, sortorder=None, names=None): | ||
| if len(arrays[i]) != len(arrays[i - 1]): | ||
| raise ValueError('all arrays must be same length') | ||
| - cats = [Categorical.from_array(arr, ordered=True) for arr in arrays] |
pijucha
Contributor
|
jreback
and 1 other
commented on an outdated diff
Aug 26, 2016
| @@ -1956,3 +1957,46 @@ def _convert_to_list_like(list_like): | ||
| else: | ||
| # is this reached? | ||
| return [list_like] | ||
| + | ||
| + | ||
| +def _factorize(values): |
jreback
Contributor
|
|
Deprecated |
|
update |
jsexauer
referenced
this pull request
Aug 31, 2016
Open
DEPR: deprecations from prior versions #6581
|
lgtm. @sinhrks @jorisvandenbossche |
jorisvandenbossche
modified the milestone: 0.19.0, 0.19.0rc
Sep 1, 2016
|
@pijucha can you rebase |
|
@jreback Done (rebase + small update to tests/test_reshape.py). One build on travis has probably stalled - should I resubmit? |
|
@pijucha i restarted the build. ping when green. |
jreback
closed this
in d26363b
Sep 2, 2016
|
thanks @pijucha really nice PR. touches lots of parts! |
pijucha commentedJul 31, 2016
•
edited
git diff upstream/master | flake8 --diffThis commit modifies
MultiIndex.from_arrayandMultiIndex.from_product.Example:
Previously, the results were:
This modification makes
groupby,pivot, andset_indexpreserve categorical types in indexes.