New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: .median(axis=1) perf issues #16468

Closed
jreback opened this Issue May 23, 2017 · 5 comments

Comments

Projects
None yet
2 participants
@jreback
Contributor

jreback commented May 23, 2017

In [2]: df = pd.DataFrame(np.random.randn(10000, 2), columns=list('AB'))

In [3]: result1 = df.median(1)

In [4]: result2 = pd.Series(np.nanmedian(df.values, axis=1), index=df.index)

In [5]: result1.equals(result2)
Out[5]: True

In [6]: %timeit result1 = df.median(1)
241 µs ± 4.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: %timeit df.median(1)
250 µs ± 5.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: %timeit pd.Series(np.nanmedian(df.values, axis=1), index=df.index)
1.77 ms ± 32.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [9]: pd.set_option('use_bottleneck', False)

In [10]: result3 = df.median(1)

In [11]: result1.equals(result3)
Out[11]: True

In [12]: %timeit df.median(1)
317 ms ± 9.66 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So, if bottleneck is installed, then df.median(1) is blazingly fast. However if its NOT installed (or not used), then we fallback to np.apply_along_axis(our_median_impl), so our median impl is pretty fast itself, but it only handles 1d, so this is a pythonic loop.

To fix we can use np.nanmedian soln if available (its in >= numpy 1.9, currently we support >= 1.7).

@jreback jreback added this to the Next Major Release milestone May 23, 2017

@rohanp

This comment has been minimized.

Contributor

rohanp commented May 25, 2017

I get similar results

>>> df = pd.DataFrame(np.random.randn(10000, 2), columns=list('AB'))

>>> pd.set_option('use_bottleneck', False)
>>> %timeit df.median(1) 
327 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit pd.Series(np.nanmedian(df.values, axis=1), index=df.index)
1.83 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> pd.set_option('use_bottleneck', True)
>>> %timeit df.median(1)
239 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Why isn't bottleneck a dependency of Pandas? I didn't even know I did not have it installed until now. Even when I set pd.set_option('use_bottleneck', True), Pandas did not give me any warning that I did not have it installed.

@jreback

This comment has been minimized.

Contributor

jreback commented May 25, 2017

see here: http://pandas.pydata.org/pandas-docs/stable/install.html#recommended-dependencies

these could be deps, but pip used to have trouble with these things and they didn't work on all platforms.

and #9422, which bottleneck changed in 1.0 (breaking the previous, IMHO correct API).

@jreback

This comment has been minimized.

Contributor

jreback commented May 25, 2017

in any event, this is easily fixed by using np.nanmedian as I said. (which again only recently came about in last 1-2 years).

@rohanp

This comment has been minimized.

Contributor

rohanp commented May 25, 2017

okay, working on the fix

@rohanp

This comment has been minimized.

Contributor

rohanp commented May 25, 2017

done: #16509

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Jan 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment