Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix reindexing with multi-indexed DataFrames #30766

Merged
merged 49 commits into from Apr 8, 2020

Conversation

ChrisRobo
Copy link
Contributor

@ChrisRobo ChrisRobo commented Jan 6, 2020

Addresses an issue which appears to have existed since 0.23.0 where bugs in the get_indexer() method for the MultiIndex class cause incorrect reindexing behavior for multi-indexed DataFrames.

motivating, example, from (the issue)

>>> 
>>> df = pd.DataFrame({
...     'a': [0, 0, 0, 0],
...     'b': [0, 2, 3, 4],
...     'c': ['A', 'B', 'C', 'D']
... }).set_index(['a', 'b'])
>>> 
>>> df
     c
a b   
0 0  A
  2  B
  3  C
  4  D
>>> df.index
MultiIndex([(0, 0),
            (0, 2),
            (0, 3),
            (0, 4)],
           names=['a', 'b'])
>>> mi_2 = pd.MultiIndex.from_product([[0], [-1, 0, 1, 3, 4, 5]])
>>> mi_2
MultiIndex([(0, -1),
            (0,  0),
            (0,  1),
            (0,  3),
            (0,  4),
            (0,  5)],
           )

as expected, without a method value:

>>> df.reindex(mi_2)
        c
0 -1  NaN
   0    A
   1  NaN
   3    C
   4    D
   5  NaN

using method="backfill", it is:

>>> 
>>> df.reindex(mi_2, method="backfill")
      c
0 -1  A
   0  A
   1  D
   3  A
   4  A
   5  C
>>>

but should (IMHO) be:

>>> df.reindex(mi_2, method="backfill")
        c
0 -1    A
   0    A
   1    B
   3    C
   4    D
   5  NaN
>>>

similarly, using method="pad", it is:

>>> df.reindex(mi_2, method="pad")
        c
0 -1  NaN
   0  NaN
   1    D
   3  NaN
   4    A
   5    C

but should (IMHO) be:

>>> df.reindex(mi_2, method="pad")
        c
0 -1  NaN
   0    A
   1    A
   3    C
   4    D
   5    D

@pep8speaks
Copy link

pep8speaks commented Jan 6, 2020

Hello @ChrisRobo! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-08 05:17:02 UTC

@ChrisRobo
Copy link
Contributor Author

Hello @ChrisRobo! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 1591:68: E231 missing whitespace after ','
Line 1597:75: E231 missing whitespace after ','
Line 1600:61: E231 missing whitespace after ','
Line 1603:61: E231 missing whitespace after ','

Line 279:64: E231 missing whitespace after ','
Line 283:53: E231 missing whitespace after ','
Line 287:53: E231 missing whitespace after ','

Comment last updated at 2020-01-06 23:44:11 UTC

the formatting decisions on these lines are auto-generated output of black pandas. I defer to the reviewers here re: the preferred style decisions

@@ -183,6 +183,155 @@ def test_get_indexer():
idx1.get_indexer(idx2)


def test_get_indexer_and_fill():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in all of these test cases, they illustrate the behavior up until 0.23.0. The assumption of this PR is that, in the absence of documentation about these changes to MultiIndex behavior in 0.23.0 (and that this PR does not break any existing tests), that those changes are bugs

pandas/_libs/index.pyx Outdated Show resolved Hide resolved
pandas/_libs/index.pyx Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don’t think the reindeximg result is valid here

pls show your original example at the top of the PR
include the case of reindex without pad/backfill

show the same example with an Index (factorize the tuples);

convince us why these should be different as that is what you are proposing (i think)

@ChrisRobo
Copy link
Contributor Author

i don’t think the reindeximg result is valid here

pls show your original example at the top of the PR
include the case of reindex without pad/backfill

show the same example with an Index (factorize the tuples);

convince us why these should be different as that is what you are proposing (i think)

thanks for the quick feedback here!

I included the original example from the issue (with both what I saw and what I expected to see) and also added the case of no filling (not included in the issue because I don't view that as a bug).

Apologies for the ignorance (this is my first PR here), but what specifically are you looking for with tuple factorization? I'm having trouble constructing an index of tuples which is not itself a MultiIndex.

My argument for why this is a bug is that backfilling/padding should respect python tuple ordering properties (which it appears to have done until version 0.23.0), e.g. (0, 0) < (0, 1) < (0, 2), and so, using the case of the example

  • when backfilling, the row with index value (0, 1) should have B (not D) as its c value, since this should be backfilled to the row with index value (0, 2) in df (whose c value is B)
  • when padding, the row with index value (0, 1) should have A (not D) as its c value, since this should be padded to the row with index value (0, 0) in df (whose c value is A)
  • moreover, in both cases, the rows with index values (0, 3) and (0, 4) should have c values C and D respectively (they are A and A with backfilling and NaN and A with padding), since (0, 3) and (0, 4) are both in df.index to begin with

Happy to add more examples or any kind of additional explanation or validation that you'd like to help make this clearer. Thanks again for the responsiveness

@ChrisRobo
Copy link
Contributor Author

ChrisRobo commented Jan 7, 2020

here's a more concise version of the example from the GH issue which still illustrates the problem, I think. I believe that since (0, 0) < (0, 1) < (0, 2) under python tuple ordering rules, that reindexing with (0, 1) with backfill should give you the value of (0, 2) and reindexing with pad should give you the value of (0, 0)

setup:

>>> df = pd.DataFrame({'a': [0, 0], 'b': [0, 2], 'c': ['A', 'B']}).set_index(['a', 'b'])
>>> df
     c
a b   
0 0  A
  2  B
>>> 
>>> mi_2 = pd.MultiIndex.from_product([[0], [1]])
>>> mi_2
MultiIndex([(0, 1)],
           )

in both cases, without a method value:

>>> df.reindex(mi_2)
       c
0 1  NaN

filling with backfill, as is (which I argue is a bug):

>>> df.reindex(mi_2, method='bfill')
     c
0 1  A

I believe it should be:

>>> df.reindex(mi_2, method='bfill')
     c
0 1  B

similarly, as is, with pad (I also believe this is a bug):

>>> df.reindex(mi_2, method='pad')
       c
0 1  NaN

whereas I believe it should be

>>> df.reindex(mi_2, method='pad')
     c
0 1  A

hopefully this is easier to read

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex labels Jan 9, 2020
@jreback
Copy link
Contributor

jreback commented Jan 9, 2020

@ChrisRobo can you update your example to show a few values, instead of a single one. some should have values even in the reindex and some should not so as to see the effect of the reindex.

@ChrisRobo
Copy link
Contributor Author

@ChrisRobo can you update your example to show a few values, instead of a single one. some should have values even in the reindex and some should not so as to see the effect of the reindex.

sure (making a new comment in the interest of preserving the history of this thread)

here's an example which I think is a bit more comprehensive.

setup:

>>> df = pd.DataFrame({
...     'a': [1, 1, 1, 4, 4, 4],
...     'b': [1, 5, 9, 1, 5, 9],
...     'c': ['A', 'B', 'C', 'D', 'E', 'F'],
... })
>>> df = df.set_index(['a', 'b'])
>>> df
     c
a b   
1 1  A
  5  B
  9  C
4 1  D
  5  E
  9  F
>>> 
>>> mi_2 = pd.MultiIndex.from_tuples([
...     (1, 0),
...     (1, 5),
...     (1, 7),
...     (3, 2),
...     (4, 9),
...     (4, 11),
...     (5, 2),
... ])
>>> mi_2
MultiIndex([(1,  0),
            (1,  5),
            (1,  7),
            (3,  2),
            (4,  9),
            (4, 11),
            (5,  2)],
           )
>>> 

current behavior w/o a filling method (no proposed change):

>>> df.reindex(mi_2)
        c
1 0   NaN
  5     B
  7   NaN
3 2   NaN
4 9     F
  11  NaN
5 2   NaN

current behavior w/backfill and pad:

>>> df.reindex(mi_2, method='backfill')
      c
1 0   A
  5   F
  7   A
3 2   A
4 9   A
  11  D
5 2   B
>>> 
>>> 
>>> df.reindex(mi_2, method='pad')
        c
1 0   NaN
  5     F
  7   NaN
3 2   NaN
4 9   NaN
  11    C
5 2     B

whereas I think this is correct:

>>> df.reindex(mi_2, method='backfill')
        c
1 0     A
  5     B
  7     C
3 2     D
4 9     F
  11  NaN
5 2   NaN
>>> 
>>> df.reindex(mi_2, method='pad')
        c
1 0   NaN
  5     B
  7     B
3 2     C
4 9     F
  11    F

this covers the case of values which are too large/too small (for backfilling/padding, respectively), cases which exist in the original and new index (i.e. (1, 5) and (4, 11)), values in the new index to which nothing is filled (i.e. (4, 5, E)), and backfilling/padding cases (i.e. those which are NaN without a method value but are not afterward).

Let me know if there's a case which this is missing but which you think it should cover!

@ChrisRobo ChrisRobo force-pushed the crobo/fix-multi-index-reindex branch from ca6fbef to 5df2152 Compare January 30, 2020 19:06
@ChrisRobo
Copy link
Contributor Author

@jreback @jbrockmendel when you get a chance, would one or both of you mind taking another look at this?

I'm happy to add more examples or test coverage if you'd like

@jreback
Copy link
Contributor

jreback commented Feb 2, 2020

will look. thought this was going to conflict with a recently merged refactor of MI indexing. pls merge master anyhow.

@ChrisRobo ChrisRobo force-pushed the crobo/fix-multi-index-reindex branch from ca13ff3 to f5521e8 Compare February 3, 2020 04:29
@ChrisRobo
Copy link
Contributor Author

will look. thought this was going to conflict with a recently merged refactor of MI indexing. pls merge master anyhow.

thanks! :). just rebased on top of master

pandas/_libs/index.pyx Outdated Show resolved Hide resolved
pandas/_libs/index.pyx Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a very complicated implementation which is adding quite a bit of code. It seems to me that you can simply re-order the codes to avoid the carrying operation entirely, no? this is similar to recoding the values due to a new factorization.

@@ -172,6 +172,7 @@ MultiIndex
index=[["a", "a", "b", "b"], [1, 2, 1, 2]])
# Rows are now ordered as the requested keys
df.loc[(['b', 'a'], [2, 1]), :]
- Bug in :meth:`MultiIndex.get_indexer` incorrectly handling use of pad and backfill options (:issue:`29896`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make a separate sub section showing the before and the new results; yes its a bug fix but the effect is non-obvious. you can use the OP example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, let me know if the formatting is OK by you!

pandas/tests/indexes/multi/test_indexing.py Outdated Show resolved Hide resolved
pandas/tests/frame/indexing/test_indexing.py Outdated Show resolved Hide resolved
@ChrisRobo
Copy link
Contributor Author

@jreback @jbrockmendel thanks for the suggestions, sorry for the late reply here.

I basically re-wrote this using what (I think is) a much simpler implementation, i.e. using something like an updated factorization of the MultiIndexes' tuples and passing that to the relevant filling method. I also added types wherever I could in cython code and made additions to the tests as requested.

@@ -2351,7 +2351,9 @@ def get_indexer(self, target, method=None, limit=None, tolerance=None):
raise NotImplementedError(
"tolerance not implemented yet for MultiIndex"
)
indexer = self._engine.get_indexer(target, method, limit)
indexer = self._engine.__class__.get_indexer_and_fill(
self.values, target, method=method, limit=limit
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be preferable (per @jbrockmendel's point about keeping Index objects out of the cython code) to just pass the MultiIndex object's tuples directly to the cython code

# 1: 2.0
# 2: 5.0
# 3: 5.8
df = pd.DataFrame({
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before but added 3 0-level values for df's (multi)index

@ChrisRobo ChrisRobo force-pushed the crobo/fix-multi-index-reindex branch from 69dd92b to 7836cc9 Compare April 8, 2020 05:16
@ChrisRobo
Copy link
Contributor Author

ChrisRobo commented Apr 8, 2020

looks fine, can you merge master and just confirm the performance of what you have here.

sorry this took a long time to review, we have many PRs. thanks

rebased on top of master, perf (in terms of times) for relevant ASV benchmarks seems the same as before

against master:

       before           after         ratio
     [e0dd81a5]       [7836cc90]
     <crobo/fix-multi-index-reindex~49>       <crobo/fix-multi-index-reindex>
+         120±7ms          484±9ms     4.05  multiindex_object.Integer.time_get_indexer_and_backfill
+         120±6ms          474±9ms     3.96  multiindex_object.Integer.time_get_indexer_and_pad
-        711±10μs          610±5μs     0.86  multiindex_object.Values.time_datetime_level_values_sliced

against 0.22.0 (last version where the correct functionality was in place):

       before           after         ratio
     [a00154dc]       [7836cc90]
     <v0.22.0^0>       <list>    
...
-        556±10ms          485±9ms     0.87  multiindex_object.Integer.time_get_indexer_and_backfill
-        626±10ms          482±5ms     0.77  multiindex_object.Integer.time_get_indexer_and_pad

and no problem re: reviews, thanks for taking the time to look at this!

@jreback jreback merged commit 7222318 into pandas-dev:master Apr 8, 2020
@jreback
Copy link
Contributor

jreback commented Apr 8, 2020

thanks @ChrisRobo great effort! and thanks for sticking with it. we are getting lots of PRs now-a-days and hard to give good review on them all.

@ChrisRobo
Copy link
Contributor Author

my pleasure, and thanks @jreback and @jbrockmendel for the time reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants