BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels #15694

Closed
wants to merge 11 commits into
from

Conversation

Projects
None yet
5 participants
Contributor

jreback commented Mar 15, 2017 edited

closes #15622
closes #15687
closes #14015
closes #13431

nice bump on Series.sort_index for monotonic

    before     after       ratio
  [37e5f78b] [a6f352c0]
-    1.86ms   100.07μs      0.05  timeseries.TimeSeries.time_sort_index_monotonic

jreback added this to the 0.20.0 milestone Mar 15, 2017

Contributor

jreback commented Mar 15, 2017

so the basic problem was we were not sorting if a MultiIndex was lexsorted. But a lexsorted index, does NOT imply that the levels are monotonic (intra-level). Depending on the construction method they might or might not be.

So what this is does is will force a reconstruction (of the MI), which is not actually expensive to do; to ensure that it is ordered correctly when sorting. (which we do in a myriad of places).

xref #13431 which I added a test (xfailing). This is a tiny bit more complicated and I think may have to modify the internals a bit.

codecov bot commented Mar 16, 2017 edited

Codecov Report

Merging #15694 into master will increase coverage by 0.02%.
The diff coverage is 98.24%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15694      +/-   ##
==========================================
+ Coverage   90.97%   90.99%   +0.02%     
==========================================
  Files         145      145              
  Lines       49474    49519      +45     
==========================================
+ Hits        45007    45060      +53     
+ Misses       4467     4459       -8
Flag Coverage Δ
#multiple 88.75% <98.24%> (+0.02%) ⬆️
#single 40.61% <17.54%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/sorting.py 97.81% <100%> (+0.03%) ⬆️
pandas/core/frame.py 97.57% <100%> (ø) ⬆️
pandas/indexes/multi.py 96.7% <100%> (+0.1%) ⬆️
pandas/core/reshape.py 99.27% <100%> (-0.01%) ⬇️
pandas/core/groupby.py 95.54% <100%> (+0.51%) ⬆️
pandas/core/series.py 94.89% <85.71%> (-0.08%) ⬇️
pandas/indexes/base.py 96.09% <0%> (-0.06%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0cfc08c...bd17d2b. Read the comment docs.

jreback changed the title from BUG: construct MultiIndex identically from levels/labels when concatting to BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels Mar 16, 2017

Contributor

jreback commented Mar 16, 2017

@chris-b1 if you'd have a look.

@@ -1807,6 +1807,13 @@ def get_group_levels(self):
'ohlc': lambda *args: ['open', 'high', 'low', 'close']
}
+ def _is_builtin_func(self, arg):
@jreback

jreback Mar 16, 2017

Contributor

ignore this, was actually an unrelated bug as this wasn't defined on BaseGrouper

Contributor

chris-b1 commented Mar 16, 2017

@jreback - I only skimmed the implementation, seems reasonable at first glance.

I do think this needs a bigger note in the docs, and maybe should even warn if the reconstruction re-sorts the levels as this is an API change? I'm in favor in the behavior in this PR, but there could be existing code that takes advantage of the customer ordering possible with a mi. e.g.

In [21]: df = pd.DataFrame({'value': [1, 2, 3, 4]}, index=pd.MultiIndex(
    ...:     levels=[['a', 'b'], ['bb', 'aa']],
    ...:     labels=[[0, 0, 1, 1], [0, 1, 0, 1]]))

In [22]: df
Out[22]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

In [23]: df.sort_index()
Out[23]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4
Contributor

jreback commented Mar 16, 2017

@chris-b1 FYI couple of recent pushes as I had some bug fixes.

This only reconstructs to actually calculate the indexer. It should not be an API change, except that some sorting before just didn't work.

Contributor

jreback commented Mar 16, 2017

@chris-b1 your example maybe with an older version

In [1]: df = pd.DataFrame({'value': [1, 2, 3, 4]}, index=pd.MultiIndex(
   ...:     ...:     levels=[['a', 'b'], ['bb', 'aa']],
   ...:     ...:     labels=[[0, 0, 1, 1], [0, 1, 0, 1]]))
   ...: 

In [2]: df
Out[2]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

In [3]: df.sort_index()
Out[3]: 
      value
a aa      2
  bb      1
b aa      4
  bb      3

In [4]: df.index
Out[4]: 
MultiIndex(levels=[['a', 'b'], ['bb', 'aa']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [5]: df.sort_index().index
Out[5]: 
MultiIndex(levels=[['a', 'b'], ['bb', 'aa']],
           labels=[[0, 0, 1, 1], [1, 0, 1, 0]])
Contributor

chris-b1 commented Mar 16, 2017

Maybe I'm misunderstanding, but won't [23] above now be this?

In [23]: df.sort_index()
Out[23]: 
      value
a aa      1
  bb      2
b aa      3
  bb      4
Contributor

jreback commented Mar 16, 2017

see [3] in my example (your index is right, but the values are not). It gets sorted.

Contributor

chris-b1 commented Mar 16, 2017

Sorry I mistyped the values. Pulled it down. this is the change in behavior - although [25] (below) looks like a bug, my point was that someone could have been relying on this if they had specified the levels.

master / 0.19.2

In [25]: df.sort_index()
Out[25]: 
      value
a bb      1
  aa      2
b bb      3
  aa      4

PR

In [3]: df.sort_index()

      value
a aa      2
  bb      1
b aa      4
  bb      3
Contributor

jreback commented Mar 16, 2017

@chris-b1 right that's the bug, they thought it sorted but actually wasn't. ok will add this as a small sub-section to show it.

Contributor

jreback commented Mar 16, 2017 edited

so just because it was cool :>

I added support (internally) for removing unused level values, ala #2770 here: 50ac461

This is still not user exposed. Though pretty trivially to make a .remove_unused_levels() function (which could just call this).

Further I think we could actually call this (its pretty cheap as long as you don't actually have unused levels, with a tiny modification) from a higher level (e.g. in DataFrame / Groupby) and such.

This is for another issue though.

cc @shoyer @wesm

Contributor

chris-b1 commented Mar 16, 2017

@chris-b1 right that's the bug,

Not to belabor the point, but what I was saying is that someone may have wanted that ordering, it was well defined behavior, if surprising. Seems to have been removed the in current docs, but there used to be a line specifically explaining that lexsorting the index does not always mean lexsorting the level values. (to be clear, I am completely for changing this)

There is an important new method sort_index to sort an axis within a MultiIndex so that its labels are grouped and sorted by the original ordering of the associated factor at that level. Note that this does not necessarily mean the labels will be sorted lexicographically!

http://pandas.pydata.org/pandas-docs/version/0.18.1/advanced.html#the-need-for-sortedness-with-multiindex

Contributor

jreback commented Mar 16, 2017

split off the unused to #15700

@chris-b1 added some docs here (just whatsnew): 1a9be09

Contributor

chris-b1 commented Mar 16, 2017

Thanks. I actually somewhat misunderstood current behavior - thought [73] below would also give the same ['bb', 'aa'] order - but I guess not. Knowing that, I do agree this is more of a bug fix than api change, but the doc example is good to show anyways!

0.19.2

df = pd.DataFrame({'value': [1, 2, 3, 4]}, index=pd.MultiIndex(
    levels=[['a', 'b'], ['bb', 'aa']],
    labels=[[0, 0, 1, 1], [1, 0, 1, 0]]))

df
Out[72]: 
      value
a aa      1
  bb      2
b aa      3
  bb      4

df.sort_index()
Out[73]: 
      value
a aa      1
  bb      2
b aa      3
  bb      4
pandas/indexes/multi.py
+ new_levels.append(lev)
+ new_labels.append(lab)
+
+ return MultiIndex(new_levels, new_labels,
@shoyer

shoyer Mar 16, 2017

Member

These variables are currently only defined if sort is True.

@jreback

jreback Mar 16, 2017

Contributor

I revised in the other PR, but will amend

Contributor

jreback commented Mar 22, 2017

any more comments on this.

@jorisvandenbossche

It will be nice to see this gotcha gone!

Added some comments:

  • Can you see whether there needs some update in the prose docs as well?
  • If you want to have the behaviour of before (in certain cases), so to sort according to order of the levels, what would be the best way to achieve this? Would maybe good to add this to the whatsnew for people who actually wanted that behaviour

Still have to go through the tests

doc/source/whatsnew/v0.20.0.txt
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In certain cases, calling ``.sort_index()`` on a MultiIndexed DataFrame would return the *same* DataFrame without seeming to sort.
+This would happen with a ``lexsorted``, but non-montonic levels. (:issue:`15622`, :issue:`15687`, :issue:`14015`, :issue:`13431`)
@jorisvandenbossche

jorisvandenbossche Mar 22, 2017

Owner

"non-montonic" -> non-monotonic

+New Behavior:
+
+.. ipython:: python
+
@jorisvandenbossche

jorisvandenbossche Mar 22, 2017

Owner

some leftovers here

@@ -3321,8 +3325,7 @@ def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
axis = self._get_axis_number(axis)
labels = self._get_axis(axis)
- # sort by the index
- if level is not None:
@jorisvandenbossche

jorisvandenbossche Mar 22, 2017

Owner

level can be 0 ?

@jreback

jreback Mar 22, 2017

Contributor

yes in fact this was what prevented

.sort_index() and .sort_index(level=0) from being the same.

pandas/core/sorting.py
@@ -93,6 +93,11 @@ def maybe_lift(lab, size): # pormote nan values
return loop(list(labels), list(shape))
+def get_compressed_ids(labels, sizes):
+ ids = get_group_index(labels, sizes, sort=True, xnull=False)
@jorisvandenbossche

jorisvandenbossche Mar 22, 2017

Owner

Can you add a docstring here?

pandas/indexes/multi.py
- names=names)
+ def _reconstruct(self, sort=False):
+ """
+ reconstruct the MultiIndex
@jorisvandenbossche

jorisvandenbossche Mar 22, 2017

Owner

Can you specify what reconstruct means ? (rearranging the labels and levels to have the levels sorted?)

Owner

jorisvandenbossche commented Mar 22, 2017 edited

If you want to have the behaviour of before (in certain cases), so to sort according to order of the levels,

My wording here is not really correct, as it is only the previous behaviour in a very specific case I think? (when they were already lexsorted?) But still, the question is probably valid I think.
Given that MultiIndex slicing needs lexsorted (or is sorted enough?) indexes, and a plain sort_index does not always give you that, it would be good to know to obtain it.

Contributor

jreback commented Mar 22, 2017

so I pushed a small doc update (with corrections as indicated above).

I also incorporated my (currently internal) method of removing unused levels (this is also included in ._reconstruct).

I am thinking about a public .remove_unused_levels() method. what do you think? In fact we have a section of the docs where do:

   To reconstruct the multiindex with only the used levels

   .. ipython:: python

      pd.MultiIndex.from_tuples(df[['foo','qux']].columns.values)

(Note that this is quite inefficient, but does work).

As to your other question. Do we provide a way to have the existing (buggy behavior). sure. you just don't sort! This is completely a bug fix.

Contributor

jreback commented Mar 22, 2017

Given that MultiIndex slicing needs lexsorted (or is sorted enough?) indexes, and a plain sort_index does not always give you that, it would be good to know to obtain it.

not sure what you mean here. the point of this PR is to fix this problem. In some cases previously, .sort_index would just refuse to work.

Given that MultiIndex slicing needs lexsorted (or is sorted enough?) indexes, and a plain sort_index does not always give you that, it would be good to know to obtain it.

not sure what you mean here. the point of this PR is to fix this problem. In some cases previously, .sort_index would just refuse to work.

As I said, my initial wording was wrong, as I shouldn't have said it was to have the previous behaviour.
But imaging the following: you have a DataFrame with a MultiIndex, and you want to do multi-index slicing. You know that, to be able to do that, you have to sort your index:

In [14]: idx = pd.MultiIndex([['A', 'B', 'C'], ['c', 'b', 'a']], [[0,1,2,0,1,2], [0,2,1,1,0,2]])

In [15]: df = pd.DataFrame({'col': range(len(idx))}, index=idx)

In [16]: df
Out[16]: 
     col
A c    0
B a    1
C b    2
A b    3
B c    4
C a    5

In [17]: df = df.sort_index()

In [18]: df
Out[18]: 
     col
A b    3
  c    0
B a    1
  c    4
C a    5
  b    2

In [19]: df.index
Out[19]: 
MultiIndex(levels=[['A', 'B', 'C'], ['c', 'b', 'a']],
           labels=[[0, 0, 1, 1, 2, 2], [1, 0, 2, 0, 2, 1]])

In [20]: df.index.is_lexsorted()
Out[20]: False

In [21]: IDX = pd.IndexSlice

In [22]: df.loc[IDX['B':'C', 'a':'c'], :]
...
UnsortedIndexError: 'MultiIndex Slicing requires the index to be fully lexsorted tuple len (2), lexsort depth (1)'

So you get a "Uh, I just sorted my index but I still get an UnsortedIndexError".

And this has not in principle something to do with this PR in the example above (as the behaviour of the above is exactly the same on master or with this PR).
But that is what I meant with "plain sort_index does not always give you that": sort_index does not guarantee that your index is lexsorted, and that you can use multi-index slicing. And so my question what would be the best way to end up with a lexsorted index.

The above is true for both this PR and master, and it is not necessarily something that this PR should solve (although we could also opt for sort_index sorting the levels as well, although this is a bigger change).
But you can also construct an example MultiIndex that is lexsorted (our terminology), but not lexicographically sorted. Previously sorting the frame would have kept the order and preserved its lexsortedness, with this PR it will no longer be lexsorted, possibly breaking multi-indexing.

Contributor

jreback commented Mar 23, 2017

@jorisvandenbossche ahh, but I actually can do this.

In [7]: df2 = df.sort_index()

In [8]: df2.index = df2.index._reconstruct(sort=True)

In [9]: df2
Out[9]: 
     col
A b    3
  c    0
B a    1
  c    4
C a    5
  b    2

In [10]: df2.index.is_lexsorted()
Out[10]: True

In [11]: df2.loc[IDX['B':'C', 'a':'c'], :]
Out[11]: 
     col
B a    1
  c    4
C a    5
  b    2

Contributor

jreback commented Mar 23, 2017 edited

In [14]: df.sort_index().index
Out[14]: 
MultiIndex(levels=[['A', 'B', 'C'], ['c', 'b', 'a']],
           labels=[[0, 0, 1, 1, 2, 2], [1, 0, 2, 0, 2, 1]])

In [15]: df2.index
Out[15]: 
MultiIndex(levels=[['A', 'B', 'C'], ['a', 'b', 'c']],
           labels=[[0, 0, 1, 1, 2, 2], [1, 2, 0, 2, 0, 1]])

I actually did try that (IOW, I would actually modify the returned index iself, rather than just the indexers).

Its actually not that big of a deal to do this and its much cheaper than the actual sort itself so I don't think there is a penalty for doing this.

The reason we don't always do this is that we are eagerly evaluated. You could do all kinds of operations that mess up lexsortedness (e.g. multiple appends, masking whatever) ,and its not clear when to reconstruct / sort.

But I think .sort_index is safe as the user asked to sort, so no harm no foul.

Contributor

jreback commented Mar 23, 2017 edited

I pushed the addtl tests.

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index c998705..1d10798 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -3359,6 +3359,9 @@ it is assumed to be aliases for the column names.')
                                    axis=baxis,
                                    convert=False, verify=False)
 
+        # reconstruct axis if needed
+        new_data.axes[baxis] = new_data.axes[baxis]._reconstruct(sort=True)
+
         if inplace:
             return self._update_inplace(new_data)
         else:
diff --git a/pandas/indexes/base.py b/pandas/indexes/base.py
index 54f73a2..722bfd2 100644
--- a/pandas/indexes/base.py
+++ b/pandas/indexes/base.py
@@ -444,6 +444,10 @@ class Index(IndexOpsMixin, StringAccessorMixin, PandasObject):
 
         return self
 
+    def _reconstruct(self, sort=False, remove_unused=False):
+        """ compat with MultiIndex """
+        return self
+
     def _update_inplace(self, result, **kwargs):
         # guard when called from IndexOpsMixin
         raise TypeError("Index can't be updated inplace")

is the fix which is almost trivial, BUT

it makes things lexsorted for sure. BUT then have to go fix some tests for stack and such. So I will make an issue but needs attacking later.

Contributor

jreback commented Apr 4, 2017

revised to replace internal _reconstruct with .sort_monotonic() and .remove_unused_levels() (now public). I think this is cleaner; revised docs a bit as well.

@chris-b1 @jorisvandenbossche

pandas/indexes/multi.py
- return MultiIndex(levels=levels, labels=labels, sortorder=sortorder,
- names=names)
+ def sort_levels_monotonic(self):
@jorisvandenbossche

jorisvandenbossche Apr 5, 2017

Owner

If this is internal, let's then call it _sort_levels_monotonic ?

@jreback

jreback Apr 7, 2017

Contributor

done

pandas/indexes/multi.py
+
+ def remove_unused_levels(self):
+ """
+ .. versionadded:: 0.20.0
@jorisvandenbossche

jorisvandenbossche Apr 5, 2017

Owner

Can you put this after the explanation? (the first sentence is what appears in api summary tables)

@jreback

jreback Apr 7, 2017

Contributor

done

pandas/indexes/multi.py
+ """
+ .. versionadded:: 0.20.0
+
+ create a new MultiIndex from the current that removesing
@jorisvandenbossche

jorisvandenbossche Apr 5, 2017

Owner

removesing -> removes

@jreback

jreback Apr 7, 2017

Contributor

done

Contributor

jreback commented Apr 7, 2017

fixed up.

@chris-b1 @jorisvandenbossche

will merge tomorrow

Good to merge.

I am thinking we might need to fix #15797 at the same time with this change (I don't mean necessarily in this PR, but the same release).
For example, in the case of issue #15622 (which is said to be closed by this PR), you would end up with a now visually sorted (that was the bug report, so that is good), but no longer lexsorted frame. So that could lead to errors when indexing.

Contributor

jreback commented Apr 7, 2017

yep will address #15797 next week.

jreback closed this in f478e4f Apr 7, 2017

@linebp linebp added a commit to linebp/pandas that referenced this pull request Apr 17, 2017

@jreback @linebp jreback + linebp BUG: DataFrame.sort_index broken if not both lexsorted and monotonic …
…in levels


closes #15622
closes #15687
closes #14015
closes #13431

Author: Jeff Reback <jeff@reback.net>

Closes #15694 from jreback/sort3 and squashes the following commits:

bd17d2b [Jeff Reback] rename sort_index_montonic -> _sort_index_monotonic
31097fc [Jeff Reback] add doc-strings, rename sort_monotonic -> sort_levels_monotonic
48249ab [Jeff Reback] add doc example
527c3a6 [Jeff Reback] simpler algo for remove_used_levels
520c9c1 [Jeff Reback] versionadded tags
f2ddc9c [Jeff Reback] replace _reconstruct with: sort_monotonic, and remove_unused_levels (public)
3c4ca22 [Jeff Reback] add degenerate test case
269cb3b [Jeff Reback] small doc updates
b234bdb [Jeff Reback] support for removing unused levels (internally)
7be8941 [Jeff Reback] incorrectly raising KeyError rather than UnsortedIndexError, caught by doc-example
47c67d6 [Jeff Reback] BUG: construct MultiIndex identically from levels/labels when concatting
fb4446c

@jreback jreback added a commit to jreback/pandas that referenced this pull request Apr 22, 2017

@jreback jreback BUG: fix degenerate MultiIndex sorting
xref #15694
closes #15797
5b382a4

@jreback jreback added a commit to jreback/pandas that referenced this pull request Apr 22, 2017

@jreback jreback BUG: fix degenerate MultiIndex sorting
xref #15694
closes #15797
80516ff

@jreback jreback added a commit that referenced this pull request Apr 22, 2017

@jreback jreback BUG: fix degenerate MultiIndex sorting (#16092)
xref #15694
closes #15797
c847884

@linebp linebp added a commit to linebp/pandas that referenced this pull request May 2, 2017

@jreback @linebp jreback + linebp BUG: fix degenerate MultiIndex sorting (#16092)
xref #15694
closes #15797
048e6fe

@pcluo pcluo added a commit to pcluo/pandas that referenced this pull request May 22, 2017

@jreback @pcluo jreback + pcluo BUG: fix degenerate MultiIndex sorting (#16092)
xref #15694
closes #15797
ba6de64
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment