Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels #15694
Conversation
jreback
added 2/3 Compat Algos API Design Bug MultiIndex Reshaping
labels
Mar 15, 2017
jreback
added this to the
0.20.0
milestone
Mar 15, 2017
|
so the basic problem was we were not sorting if a MultiIndex was lexsorted. But a lexsorted index, does NOT imply that the levels are monotonic (intra-level). Depending on the construction method they might or might not be. So what this is does is will force a reconstruction (of the MI), which is not actually expensive to do; to ensure that it is ordered correctly when sorting. (which we do in a myriad of places). xref #13431 which I added a test (xfailing). This is a tiny bit more complicated and I think may have to modify the internals a bit. |
jreback
removed 2/3 Compat Algos API Design
labels
Mar 15, 2017
codecov
bot
commented
Mar 16, 2017
•
Codecov Report
@@ Coverage Diff @@
## master #15694 +/- ##
==========================================
+ Coverage 90.97% 90.99% +0.02%
==========================================
Files 145 145
Lines 49474 49519 +45
==========================================
+ Hits 45007 45060 +53
+ Misses 4467 4459 -8
Continue to review full report at Codecov.
|
jreback
changed the title from
BUG: construct MultiIndex identically from levels/labels when concatting to BUG: DataFrame.sort_index broken if not both lexsorted and monotonic in levels
Mar 16, 2017
|
@chris-b1 if you'd have a look. |
| @@ -1807,6 +1807,13 @@ def get_group_levels(self): | ||
| 'ohlc': lambda *args: ['open', 'high', 'low', 'close'] | ||
| } | ||
| + def _is_builtin_func(self, arg): |
jreback
Mar 16, 2017
Contributor
ignore this, was actually an unrelated bug as this wasn't defined on BaseGrouper
|
@jreback - I only skimmed the implementation, seems reasonable at first glance. I do think this needs a bigger note in the docs, and maybe should even warn if the reconstruction re-sorts the levels as this is an API change? I'm in favor in the behavior in this PR, but there could be existing code that takes advantage of the customer ordering possible with a mi. e.g. In [21]: df = pd.DataFrame({'value': [1, 2, 3, 4]}, index=pd.MultiIndex(
...: levels=[['a', 'b'], ['bb', 'aa']],
...: labels=[[0, 0, 1, 1], [0, 1, 0, 1]]))
In [22]: df
Out[22]:
value
a bb 1
aa 2
b bb 3
aa 4
In [23]: df.sort_index()
Out[23]:
value
a bb 1
aa 2
b bb 3
aa 4 |
|
@chris-b1 FYI couple of recent pushes as I had some bug fixes. This only reconstructs to actually calculate the indexer. It should not be an API change, except that some sorting before just didn't work. |
|
@chris-b1 your example maybe with an older version
|
|
Maybe I'm misunderstanding, but won't
|
|
see [3] in my example (your index is right, but the values are not). It gets sorted. |
|
Sorry I mistyped the values. Pulled it down. this is the change in behavior - although master / 0.19.2 In [25]: df.sort_index()
Out[25]:
value
a bb 1
aa 2
b bb 3
aa 4PR In [3]: df.sort_index()
value
a aa 2
bb 1
b aa 4
bb 3 |
|
@chris-b1 right that's the bug, they thought it sorted but actually wasn't. ok will add this as a small sub-section to show it. |
|
so just because it was cool :> I added support (internally) for removing unused level values, ala #2770 here: 50ac461 This is still not user exposed. Though pretty trivially to make a Further I think we could actually call this (its pretty cheap as long as you don't actually have unused levels, with a tiny modification) from a higher level (e.g. in DataFrame / Groupby) and such. This is for another issue though. |
Not to belabor the point, but what I was saying is that someone may have wanted that ordering, it was well defined behavior, if surprising. Seems to have been removed the in current docs, but there used to be a line specifically explaining that lexsorting the index does not always mean lexsorting the level values. (to be clear, I am completely for changing this)
|
jreback
referenced
this pull request
Mar 16, 2017
Closed
ENH: support for removing unused levels of a MultiIndex (interally) #15700
|
Thanks. I actually somewhat misunderstood current behavior - thought 0.19.2
|
| + new_levels.append(lev) | ||
| + new_labels.append(lab) | ||
| + | ||
| + return MultiIndex(new_levels, new_labels, |
|
any more comments on this. |
jorisvandenbossche
reviewed
Mar 22, 2017
It will be nice to see this gotcha gone!
Added some comments:
- Can you see whether there needs some update in the prose docs as well?
- If you want to have the behaviour of before (in certain cases), so to sort according to order of the levels, what would be the best way to achieve this? Would maybe good to add this to the whatsnew for people who actually wanted that behaviour
Still have to go through the tests
| +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| + | ||
| +In certain cases, calling ``.sort_index()`` on a MultiIndexed DataFrame would return the *same* DataFrame without seeming to sort. | ||
| +This would happen with a ``lexsorted``, but non-montonic levels. (:issue:`15622`, :issue:`15687`, :issue:`14015`, :issue:`13431`) |
| +New Behavior: | ||
| + | ||
| +.. ipython:: python | ||
| + |
| @@ -3321,8 +3325,7 @@ def sort_index(self, axis=0, level=None, ascending=True, inplace=False, | ||
| axis = self._get_axis_number(axis) | ||
| labels = self._get_axis(axis) | ||
| - # sort by the index | ||
| - if level is not None: |
jreback
Mar 22, 2017
Contributor
yes in fact this was what prevented
.sort_index() and .sort_index(level=0) from being the same.
| @@ -93,6 +93,11 @@ def maybe_lift(lab, size): # pormote nan values | ||
| return loop(list(labels), list(shape)) | ||
| +def get_compressed_ids(labels, sizes): | ||
| + ids = get_group_index(labels, sizes, sort=True, xnull=False) |
| - names=names) | ||
| + def _reconstruct(self, sort=False): | ||
| + """ | ||
| + reconstruct the MultiIndex |
jorisvandenbossche
Mar 22, 2017
Owner
Can you specify what reconstruct means ? (rearranging the labels and levels to have the levels sorted?)
My wording here is not really correct, as it is only the previous behaviour in a very specific case I think? (when they were already lexsorted?) But still, the question is probably valid I think. |
|
so I pushed a small doc update (with corrections as indicated above). I also incorporated my (currently internal) method of removing unused levels (this is also included in I am thinking about a public
(Note that this is quite inefficient, but does work). As to your other question. Do we provide a way to have the existing (buggy behavior). sure. you just don't sort! This is completely a bug fix. |
not sure what you mean here. the point of this PR is to fix this problem. In some cases previously, |
As I said, my initial wording was wrong, as I shouldn't have said it was to have the previous behaviour.
So you get a "Uh, I just sorted my index but I still get an UnsortedIndexError". And this has not in principle something to do with this PR in the example above (as the behaviour of the above is exactly the same on master or with this PR). The above is true for both this PR and master, and it is not necessarily something that this PR should solve (although we could also opt for |
|
@jorisvandenbossche ahh, but I actually can do this.
|
I actually did try that (IOW, I would actually modify the returned index iself, rather than just the indexers). Its actually not that big of a deal to do this and its much cheaper than the actual sort itself so I don't think there is a penalty for doing this. The reason we don't always do this is that we are eagerly evaluated. You could do all kinds of operations that mess up lexsortedness (e.g. multiple appends, masking whatever) ,and its not clear when to reconstruct / sort. But I think |
|
I pushed the addtl tests.
is the fix which is almost trivial, BUT it makes things |
This was referenced Mar 23, 2017
|
revised to replace internal |
| - return MultiIndex(levels=levels, labels=labels, sortorder=sortorder, | ||
| - names=names) | ||
| + def sort_levels_monotonic(self): |
jorisvandenbossche
Apr 5, 2017
Owner
If this is internal, let's then call it _sort_levels_monotonic ?
| + | ||
| + def remove_unused_levels(self): | ||
| + """ | ||
| + .. versionadded:: 0.20.0 |
jorisvandenbossche
Apr 5, 2017
Owner
Can you put this after the explanation? (the first sentence is what appears in api summary tables)
| + """ | ||
| + .. versionadded:: 0.20.0 | ||
| + | ||
| + create a new MultiIndex from the current that removesing |
jreback
added some commits
Mar 15, 2017
|
fixed up. will merge tomorrow |
|
Good to merge. I am thinking we might need to fix #15797 at the same time with this change (I don't mean necessarily in this PR, but the same release). |
|
yep will address #15797 next week. |
jreback
closed this
in f478e4f
Apr 7, 2017
linebp
added a commit
to linebp/pandas
that referenced
this pull request
Apr 17, 2017
|
|
jreback + linebp |
fb4446c
|
jreback
referenced
this pull request
Apr 20, 2017
Closed
Time-based .rolling() fails with .groupby() #13966
jreback
added a commit
to jreback/pandas
that referenced
this pull request
Apr 22, 2017
|
|
jreback |
5b382a4
|
jreback
referenced
this pull request
Apr 22, 2017
Merged
BUG: fix degenerate MultiIndex sorting #16092
jreback
added a commit
to jreback/pandas
that referenced
this pull request
Apr 22, 2017
|
|
jreback |
80516ff
|
jreback
added a commit
that referenced
this pull request
Apr 22, 2017
|
|
jreback |
c847884
|
linebp
added a commit
to linebp/pandas
that referenced
this pull request
May 2, 2017
|
|
jreback + linebp |
048e6fe
|
pcluo
added a commit
to pcluo/pandas
that referenced
this pull request
May 22, 2017
|
|
jreback + pcluo |
ba6de64
|
jreback commentedMar 15, 2017
•
edited
closes #15622
closes #15687
closes #14015
closes #13431
nice bump on Series.sort_index for monotonic