Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Avoid MultiIndex conversion for IntervalIndex methods #24813

Closed
5 of 9 tasks
jschendel opened this issue Jan 17, 2019 · 7 comments
Closed
5 of 9 tasks

PERF: Avoid MultiIndex conversion for IntervalIndex methods #24813

jschendel opened this issue Jan 17, 2019 · 7 comments
Labels
Interval Interval data type Master Tracker High level tracker for similar issues Performance Memory or execution speed performance setops union, intersection, difference, symmetric_difference

Comments

@jschendel
Copy link
Member

jschendel commented Jan 17, 2019

There are a few IntervalIndex methods that convert to a MultiIndex as an intermediate step, and then use the associated MultiIndex method to compute the result. This likely introduces overhead that could be avoided via a more direct IntervalIndex implementation.

Methods that currently require a MultiIndex conversion:

@jschendel jschendel added Performance Memory or execution speed performance Interval Interval data type Master Tracker High level tracker for similar issues labels Jan 17, 2019
@jschendel jschendel added this to the Contributions Welcome milestone Jan 17, 2019
@stevenbw
Copy link

@jschendel I will look into this.

@vfilimonov
Copy link
Contributor

Hello

Do I understand correctly that slow DataFrame.mul, DataFrame.add, DataFrame.div, DataFrame.sub all belongs here (similar to #30267) or is it a separate issue?

df = pd.DataFrame(np.random.randn(500, 1000))
xx = pd.Series(100, index=df.columns)

%timeit df.mul(xx, axis=1)  # 328 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.multiply(df, xx)  # 935 µs ± 61.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@simonjayhawkins
Copy link
Member

Do I understand correctly that slow DataFrame.mul, DataFrame.add, DataFrame.div, DataFrame.sub all belongs here (similar to #30267) or is it a separate issue?

I'm not sure why this issue was mentioned in #30267.

timings for #30267 are now comparable using master

%timeit x1 = df * 50  # 258 ms ± 14.6 ms per loop
# 2.78 ms ± 91.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit x2 = df * df  # 1.57 ms ± 9.16 µs per loop
# 2.67 ms ± 55.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit x3 = np.multiply(df, 50)  # 878 µs ± 71.7 µs per loop
# 3.28 ms ± 39.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

also getting comparable timings with df.mul using master

%timeit df.mul(xx, axis=1)  # 328 ms ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 2.93 ms ± 48.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit np.multiply(df, xx)  # 935 µs ± 61.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 3.49 ms ± 88.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@vfilimonov
Copy link
Contributor

Great, thank you @simonjayhawkins . I look forward for 1.1!
(and it looks like pandas is now faster than numpy?!)

p.s. just to make sure on 1.0.5 full dataframe multiplication is also 2-3 times slower:

df = pd.DataFrame(np.random.randn(500, 1000))
%timeit df * df  # 1.8 ms ± 149 µs per loop
%timeit df.mul(df)  # 1.74 ms ± 134 µs per loop
%timeit np.multiply(df, df)  #742 µs ± 49.2 µs per loop

Is it now comparable in master as well?

@simonjayhawkins
Copy link
Member

yep, getting comparable numbers for those too.

%timeit df * df  # 1.8 ms ± 149 µs per loop
# 2.82 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%timeit df.mul(df)  # 1.74 ms ± 134 µs per loop
# 2.99 ms ± 219 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


%timeit np.multiply(df, df)  #742 µs ± 49.2 µs per loop
# 3.11 ms ± 42.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@vfilimonov
Copy link
Contributor

Wonderful! Thanks a lot!

@jbrockmendel jbrockmendel added the setops union, intersection, difference, symmetric_difference label Jun 17, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@mroeschke
Copy link
Member

I don't see the remaining ops in the checklist dispatching to MultiIndex so I think we can close this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interval Interval data type Master Tracker High level tracker for similar issues Performance Memory or execution speed performance setops union, intersection, difference, symmetric_difference
Projects
None yet
Development

No branches or pull requests

6 participants