Skip to content

Conversation

jeffzi
Copy link
Contributor

@jeffzi jeffzi commented Jan 25, 2020

The intersection of 2 MultiIndex with sort=False does not preserve the order, whereas Index. intersection() does. This behavior does not seem to be intentional since the tests for difference and symmetric_difference are testing for order preservation.

For example:

import pandas as pd

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
idx = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

left = idx[2::-1]
print(left)
#> MultiIndex([('baz', 'one'),
#>             ('bar', 'two'),
#>             ('bar', 'one')],
#>            names=['first', 'second'])
right = idx[:5]
print(right)
#> MultiIndex([('bar', 'one'),
#>             ('bar', 'two'),
#>             ('baz', 'one'),
#>             ('baz', 'two'),
#>             ('foo', 'one')],
#>            names=['first', 'second'])

# expected same order as left
intersect = left.intersection(right, sort=False)
print(intersect)
#> MultiIndex([('bar', 'two'),
#>             ('baz', 'one'),
#>             ('bar', 'one')],
#>            names=['first', 'second'])

Created on 2020-01-25 by the reprexpy package

I verified that my PR does not decrease performances.

I also modified the test for union. The implementation was preserving the order with sort=False but the tests were not verifying it.

@charlesdong1991
Copy link
Member

Very nice! @jeffzi

could you pls also create an issue and reference this PR to the issue?

@jeffzi jeffzi force-pushed the fix-multiindex-intersection-sort branch from 7a066e6 to 8320d41 Compare January 26, 2020 10:08
@jeffzi
Copy link
Contributor Author

jeffzi commented Jan 26, 2020

I benchmarked the intersection of 2 arrays of tuples: master Vs this PR Vs suggestion by @jreback to take inspiration of difference.

I also looked at the implementation of Index.intersection and found that performances can be increased in the case of 2 monotonic MultiIndex.

@jreback If that's ok with you, I can commit the "mixed" version below.

iimport numpy as np
import pandas as pd
from pandas.testing import assert_index_equal

def master(self, other):
    return set(self._ndarray_values) & set(other._ndarray_values)

def pr(self, other):
    lvals = self._ndarray_values
    rvals = other._ndarray_values
    
    other_uniq = set(rvals)
    seen = set()
    return [x for x in lvals if x in other_uniq and not (x in seen or seen.add(x))]

def indexer(self, other):
    lvals = self._ndarray_values

    if self.is_monotonic and other.is_monotonic:
        rvals = other._ndarray_values
        return self._inner_indexer(lvals, rvals)[0]

    other_uniq = other.unique()
    indexer = other_uniq.get_indexer(lvals)
    indexer = indexer.take((indexer != -1).nonzero()[0])

    return other_uniq.take(indexer)._ndarray_values

def mixed(self, other):
    lvals = self._ndarray_values
    rvals = other._ndarray_values

    if self.is_monotonic and other.is_monotonic:
        return self._inner_indexer(lvals, rvals)[0]

    other_uniq = set(rvals)
    seen = set()
    return [x for x in lvals if x in other_uniq and not (x in seen or seen.add(x))]

size = 100000 
left = pd.MultiIndex.from_arrays([np.arange(1, size), np.arange(1, size)])
right = pd.MultiIndex.from_arrays([np.arange(1, size//10)[::-1], np.arange(1, size//10)[::-1]])
right_monotonic = pd.MultiIndex.from_arrays([np.arange(1, size//10), np.arange(1, size//10)])

for r in [right, right_monotonic]:
    print(f"\nBoth monotonic: {r.is_monotonic}\n---------------")
    expected = pd.MultiIndex.from_tuples(pr(left, r))
    actual_indexer = pd.MultiIndex.from_tuples(indexer(left, r))
    assert_index_equal(expected, actual_indexer)
    actual_mixed = pd.MultiIndex.from_tuples(mixed(left, r))
    assert_index_equal(expected, actual_mixed)

    print("master implementation:")
    %timeit -n 10 master(left, r)
    print("pr implementation:")
    %timeit -n 10 pr(left, r)
    print("Suggested indexer implementation:")
    %timeit -n 10 indexer(left, r)
    print("Mixed implementation:")
    %timeit -n 10 mixed(left, r)
Both monotonic: False
---------------
master implementation:
18.2 ms ± 5.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
pr implementation:
10.6 ms ± 625 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Suggested indexer implementation:
170 ms ± 4.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Mixed implementation:
10.2 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Both monotonic: True
---------------
master implementation:
15.1 ms ± 984 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
pr implementation:
10.4 ms ± 730 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Suggested indexer implementation:
2.56 ms ± 292 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Mixed implementation:
2.63 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
'platform': 'Darwin',
'platform-release': '19.2.0',
'architecture': 'x86_64',
'ram': '16 GB'

@jeffzi jeffzi changed the title Bug: MultiIndex intersection with sort=False does not preserve order BUG: MultiIndex intersection with sort=False does not preserve order Jan 26, 2020
@jeffzi jeffzi force-pushed the fix-multiindex-intersection-sort branch from 83518d1 to 5273652 Compare January 27, 2020 18:30
@jeffzi jeffzi requested a review from jreback January 28, 2020 09:08
@jeffzi jeffzi force-pushed the fix-multiindex-intersection-sort branch 3 times, most recently from 632f23e to 6b8f2d7 Compare February 4, 2020 09:20
@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 9, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. can you add a whatsnew note in bugfixes for MI. ping on green.

@jreback
Copy link
Contributor

jreback commented Feb 9, 2020

pls merge master as well

@jreback jreback requested a review from jbrockmendel February 9, 2020 21:54
@jeffzi jeffzi force-pushed the fix-multiindex-intersection-sort branch from 74f5fde to 7fb060d Compare February 10, 2020 11:32
@jeffzi
Copy link
Contributor Author

jeffzi commented Feb 10, 2020

@jreback All green :)

@jreback jreback added this to the 1.1 milestone Feb 10, 2020
@jreback
Copy link
Contributor

jreback commented Feb 10, 2020

lgtm. @jbrockmendel

@jbrockmendel
Copy link
Member

Couple of comments/questions, no deal-breakers. looks nice

@WillAyd WillAyd merged commit c2f3ce3 into pandas-dev:master Feb 12, 2020
@WillAyd
Copy link
Member

WillAyd commented Feb 12, 2020

Thanks @jeffzi great first PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: MultiIndex intersection with sort=False does not preserve order
5 participants