Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERF] Get rid of MultiIndex conversion in IntervalIndex.is_unique #26391

Merged
merged 7 commits into from May 28, 2019

Conversation

makbigc
Copy link
Contributor

@makbigc makbigc commented May 14, 2019

This PR follows #25159 (in which I broke something when merging).

All benchmarks:

       before           after         ratio
     [304d8d4b]       [4ec1fe97]
     <master>         <intv-is-unique>
        228±0.2ns        230±0.2ns     1.01  index_object.IntervalIndexMethod.time_is_unique(1000)
-         450±7ns          277±2ns     0.62  index_object.IntervalIndexMethod.time_is_unique(100000)

The approach this time doesn't worsen the performance.

@codecov
Copy link

codecov bot commented May 14, 2019

Codecov Report

Merging #26391 into master will decrease coverage by <.01%.
The diff coverage is 93.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26391      +/-   ##
==========================================
- Coverage   91.69%   91.68%   -0.01%     
==========================================
  Files         174      174              
  Lines       50749    50763      +14     
==========================================
+ Hits        46534    46543       +9     
- Misses       4215     4220       +5
Flag Coverage Δ
#multiple 90.19% <93.33%> (ø) ⬆️
#single 41.15% <0%> (-0.17%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 95.17% <93.33%> (-0.06%) ⬇️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 97.02% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 612c244...d11acd6. Read the comment docs.

@codecov
Copy link

codecov bot commented May 14, 2019

Codecov Report

Merging #26391 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26391      +/-   ##
==========================================
- Coverage   91.69%   91.69%   -0.01%     
==========================================
  Files         174      174              
  Lines       50749    50754       +5     
==========================================
+ Hits        46534    46537       +3     
- Misses       4215     4217       +2
Flag Coverage Δ
#multiple 90.2% <100%> (ø) ⬆️
#single 41.15% <0%> (-0.17%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/interval.py 95.33% <100%> (+0.1%) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/compat/__init__.py 93.54% <0%> (-0.4%) ⬇️
pandas/plotting/_style.py 76.92% <0%> (-0.26%) ⬇️
pandas/plotting/_misc.py 38.23% <0%> (-0.23%) ⬇️
pandas/core/frame.py 97.02% <0%> (-0.12%) ⬇️
pandas/util/testing.py 90.6% <0%> (-0.11%) ⬇️
pandas/plotting/_converter.py 63.66% <0%> (-0.06%) ⬇️
pandas/core/groupby/generic.py 88.96% <0%> (-0.05%) ⬇️
pandas/core/dtypes/dtypes.py 96.65% <0%> (-0.02%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 612c244...d3af9c9. Read the comment docs.

@jreback jreback added Performance Memory or execution speed performance Interval Interval data type labels May 14, 2019
@jreback
Copy link
Contributor

jreback commented May 14, 2019

lgtm. @jschendel

@jreback jreback added this to the 0.25.0 milestone May 14, 2019
Copy link
Member

@jschendel jschendel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, the approach looks good overall, just have some comments on the implementation.

return False
return True

if len(self) - len(self.dropna()) > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think self.isna().sum() > 1 is a little more idiomatic and performant.

Doing single runs to avoid caching:

In [2]: ii = pd.interval_range(0, 10**5)

In [3]: ii_nan = ii.insert(1, np.nan).insert(12345, np.nan)

In [4]: %timeit -r 1 -n 1 ii.isna().sum() > 1
435 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [5]: %timeit -r 1 -n 1 ii_nan.isna().sum() > 1
444 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [6]: ii = pd.interval_range(0, 10**5)

In [7]: ii_nan = ii.insert(1, np.nan).insert(12345, np.nan)

In [8]: %timeit -r 1 -n 1 len(ii) - len(ii.dropna()) > 1
677 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

In [9]: %timeit -r 1 -n 1 len(ii_nan) - len(ii_nan.dropna()) > 1
2.18 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

@@ -461,7 +461,28 @@ def is_unique(self):
"""
Return True if the IntervalIndex contains unique elements, else False
"""
return self._multiindex.is_unique
left = self.values.left
right = self.values.right
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.values.left should be equivalent to self.left, so I think we can get by without needing to define these, and just refer to them as self.left/self.right where needed

pandas/core/indexes/interval.py Outdated Show resolved Hide resolved
left = self.values.left
right = self.values.right

def _is_unique(left, right):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If my previous comment is correct, I don't think we need this to be a function anymore since it's only called once, so you can just put the function's logic at the end of the method.

Can you also test out the following variant of _is_unique:

from collections import defaultdict


def _is_unique2(left, right):
    seen_pairs = defaultdict(bool)
    check_idx = np.where(left.duplicated(keep=False))[0]
    for idx in check_idx:
        pair = (left[idx], right[idx])
        if seen_pairs[pair]:
            return False
        seen_pairs[pair] = True

    return True

I did a sample run of this, and it appears to be a bit more efficient:

In [3]: np.random.seed(123)
   ...: left = pd.Index(np.random.randint(5, size=10**5))
   ...: right = pd.Index(np.random.randint(10**5/4, size=10**5))

In [4]: %timeit _is_unique(left, right)
3.84 ms ± 34.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit _is_unique2(left, right)
1.13 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I haven't fully tested this in all scenarios though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HEAD adopts _is_unique2 and HEAD~3 adopts _is_unique. The performance is slightly worse but the code is more explanatory.

       before           after         ratio
     [4ec1fe97]       [202b2cfa]
     <intv-is-unique~3>       <intv-is-unique>
        230±0.2ns          217±2ns     0.94  index_object.IntervalIndexMethod.time_is_unique(1000)
+         277±2ns          316±5ns     1.14  index_object.IntervalIndexMethod.time_is_unique(100000)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these asv's really short? maybe have a longer one and see how this scales

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was wondering this too; is_unique is cached, so I wonder if the asv is just timing the cache lookup? Does anything special need to be done to handle things that are cached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HEAD~6 adopts _is_unique while HEAD adopts _is_unique2.

       before           after         ratio
     [4ec1fe97]       [d3af9c91]
     <intv-is-unique~6>       <intv-is-unique>
          223±1ns          207±1ns     0.93  index_object.IntervalIndexMethod.time_is_unique(1000)
+         270±5ns          302±4ns     1.12  index_object.IntervalIndexMethod.time_is_unique(100000)
       1.50±0.01s          1.84±0s    ~1.22  index_object.IntervalIndexMethod.time_is_unique(10000000)

if left.is_unique or right.is_unique:
return True

seen_pairs = defaultdict(bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just use a set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes, that's a lot cleaner - originally was treating endpoints separately and needed the dict structure but should be unnecessary when dealing with tuples

return True

seen_pairs = set()
check_idx = np.where(left.duplicated(keep=False))[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acutally can't you just do this

pairs = [(left[idx], right[idx] for idx in checks_idx]
return len(set(pairs)) == len(pairs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The present approach may be better in which False is returned once a duplicate is found.
To compare the length, we run over all potential duplicates.

@makbigc
Copy link
Contributor Author

makbigc commented May 23, 2019

@jreback Would you tell me if you have anything to implement?

@jreback
Copy link
Contributor

jreback commented May 26, 2019

lgtm. @jschendel merge if you are ok with this.

@jschendel jschendel merged commit 998a0de into pandas-dev:master May 28, 2019
@jschendel
Copy link
Member

thanks @makbigc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interval Interval data type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants