New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: optimized median func when bottleneck not present #16509

Merged
merged 18 commits into from Jan 22, 2018

Conversation

Projects
None yet
6 participants
@rohanp
Contributor

rohanp commented May 25, 2017

closes #16468

Uses np.nanmedian to compute median instead of internal algos.median when bottleneck is not present/turned off. Has order of magnitude performance benefits

df = pd.DataFrame(np.random.randn(10000, 2), columns=list('AB'))

Before

>>> pd.set_option('use_bottleneck', False)
>>> %timeit df.median(1)
318 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit pd.Series(np.nanmedian(df.values, axis=1), index=df.index)
1.83 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> pd.set_option('use_bottleneck', True)
>>> %timeit df.median(1)
239 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After

>>> pd.set_option('use_bottleneck', False)
>>> %timeit df.median(1)
1.89 ms ± 53.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> pd.set_option('use_bottleneck', True)
>>> %timeit df.median(1)
227 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
values, mask, dtype, dtype_max = _get_values(values, skipna)
def get_median(x):

This comment has been minimized.

@jreback

jreback May 25, 2017

Contributor

and this will break in numpy < 1.9, you need to do this conditionally.

@jreback

This comment has been minimized.

Contributor

jreback commented May 25, 2017

needs an asv (run with and w/o bottleneck)

@rohanp

This comment has been minimized.

Contributor

rohanp commented May 25, 2017

Do i need to run all the tests or can i get away with a subset?

@rohanp

This comment has been minimized.

Contributor

rohanp commented May 25, 2017

Also do I have to manually turn bottleneck on and off for the tests?

@jreback

This comment has been minimized.

Contributor

jreback commented May 25, 2017

Also do I have to manually turn bottleneck on and off for the tests?

have 2 tests, 1 with it forced on, 1 with forced off

Do i need to run all the tests or can i get away with a subset?

just the median tests :>

@jreback

This comment has been minimized.

Contributor

jreback commented May 25, 2017

actually I don't think bottleneck is in our asv config. (separate issue is it should be)

@rohanp

This comment has been minimized.

Contributor

rohanp commented May 25, 2017

just the median tests :>

does that just entail test_nanops.py?

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 26, 2017

@rohanp asv continuous -f 1.1 upstream/master HEAD -b median should work. See http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-performance-test-suite

@rohanp

This comment has been minimized.

Contributor

rohanp commented May 27, 2017

I don't understand why the tests are failing. How do I see the test cases that failed?

@jreback

This comment has been minimized.

Contributor

jreback commented May 27, 2017

click on the red x

@rohanp

This comment has been minimized.

Contributor

rohanp commented May 29, 2017

I'm not quite sure what you mean. I figured out how to see the details of the tests that failed; for example on circleci I can see

self = <pandas.tests.test_nanops.TestnanopsDataFrame object at 0x7f67d2d65c88>

    def test_nanmedian(self):
        with warnings.catch_warnings(record=True):
            self.check_funs(nanops.nanmedian, np.median, allow_complex=False,
                            allow_str=False, allow_date=False,
>                           allow_tdelta=True, allow_obj='convert')

pandas/tests/test_nanops.py:374: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/test_nanops.py:245: in check_funs
    **kwargs)
pandas/tests/test_nanops.py:234: in check_fun
    targarnanval, **kwargs)
pandas/tests/test_nanops.py:193: in check_fun_data
    check_dtype=check_dtype)
pandas/tests/test_nanops.py:146: in check_results
    tm.assert_almost_equal(targ, res, check_dtype=check_dtype)
pandas/util/testing.py:234: in assert_almost_equal
    **kwargs)
pandas/_libs/testing.pyx:59: in pandas._libs.testing.assert_almost_equal (pandas/_libs/testing.c:4156)
    cpdef assert_almost_equal(a, b,
pandas/_libs/testing.pyx:173: in pandas._libs.testing.assert_almost_equal (pandas/_libs/testing.c:3274)
    raise_assert_detail(obj, msg, lobj, robj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

obj = 'numpy array', message = 'numpy array values are different (20.0 %)'
left = '[[nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan]]'
right = '[[0.0511871111488, -0.458724351947, -0.0208572588354, 0.00471620824214, -0.487450340243], [0.254176310551, -0.0309721..., -0.2722898607, 0.174928469595], [-0.431481894512, 0.245270844279, 0.140055994353, -0.0727601005076, 0.238682362073]]'

But what I do not understand is how to find the input that caused the test to fail. I checked the test_nanops.py file and while it has the tests it does not seem to have the test cases.

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 30, 2017

@rohanp these tests have several layers of indirection (to allow more code reuse). I would try pytest pandsa/tests/test_nanops.py -k test_nanmedian --pdb to drop into a debugger on failure so you can inspect what's going on. But the basic idea is test_nanmedian -> check_funs -> check_fun -> check_fun_data -> check_results, which is failing for one of the input array.

The actual arrays will be something like self.arr_float or arr_float_nan and are accessed with getattr.

@jreback

This comment has been minimized.

Contributor

jreback commented Jun 10, 2017

can you rebase and update?

@jreback

This comment has been minimized.

Contributor

jreback commented Jul 7, 2017

can you rebase and update

@jreback

This comment has been minimized.

Contributor

jreback commented Jul 19, 2017

can you rebase / update according to comments

@pep8speaks

This comment has been minimized.

pep8speaks commented Aug 8, 2017

Hello @rohanp! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 21, 2018 at 21:03 Hours UTC
@rohanp

This comment has been minimized.

Contributor

rohanp commented Aug 8, 2017

Hi, sorry for the delay. I think I figured out the bug. Thanks @TomAugspurger for the debugging command. How do I rebase and update?

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Aug 8, 2017

@rohanp you can do the following (assuming that pandas-dev/pandas remote repo is called 'upstream')

git fetch upstream
git checkout rohanp
git merge upstream/master # after this you can fix merge conflicts, and do 'git commit' to continue
git push origin rohanp
@@ -36,6 +37,7 @@ def set_use_bottleneck(v=True):
_USE_BOTTLENECK = v

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Aug 8, 2017

Member

you can remove this blank line? (in general, please try to avoid such changes on lines that you are not changing for this PR)

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):
if not skipna and isnull(np.sum(values)):
return np.median(values, axis=axis)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Aug 8, 2017

Member

Is this special case needed? As np.nanmedian is almost as fast as np.median, also when there are no NaNs.

This comment has been minimized.

@jreback

jreback Aug 8, 2017

Contributor

this is not needed

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Aug 9, 2017

Member

@rohanp Can you deal with this comment? (remove or give some explanation on why you think it is needed)

BTW, isnull was recently be renamed to isna in this file, that is the reason all tests are failing now.

@jreback

need a whatsnew note in performance section.

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):
if not skipna and isnull(np.sum(values)):
return np.median(values, axis=axis)
values, mask, dtype, dtype_max = _get_values(values, skipna)

This comment has been minimized.

@jreback

jreback Aug 8, 2017

Contributor

use _np_version_under1p19 which are already defined

This comment has been minimized.

@jreback

jreback Aug 8, 2017

Contributor

then define

if _np_under_version1p19:
       agg1d = get_median
       aggaxis = lambda values, axis: np.apply_along_axis(get_median, axis, values)
else:
      agg1d = np.nanmedian
      aggaxis = lambda values, axis: np.nanmedian(values, axis)

then the code is almost the same as before, just replace with agg1d and aggaxis

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):
if not skipna and isnull(np.sum(values)):
return np.median(values, axis=axis)

This comment has been minimized.

@jreback

jreback Aug 8, 2017

Contributor

this is not needed

@jorisvandenbossche jorisvandenbossche added this to the 0.21.0 milestone Aug 9, 2017

@jorisvandenbossche

update looks good, please have a look at the remaining comment

Further:

  • can you add a whatsnew note? (in performance improvements section)
@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):
if not skipna and isnull(np.sum(values)):
return np.median(values, axis=axis)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Aug 9, 2017

Member

@rohanp Can you deal with this comment? (remove or give some explanation on why you think it is needed)

BTW, isnull was recently be renamed to isna in this file, that is the reason all tests are failing now.

@@ -189,6 +189,10 @@ def check_fun_data(self, testfunc, targfunc, testarval, targarval,
targ = targfunc(targartempval, axis=axis, **kwargs)
res = testfunc(testarval, axis=axis, skipna=skipna,
**kwargs)
#if not np.allclose(targ, res):
# import pdb; pdb.set_trace()

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Aug 9, 2017

Member

leftover to remove

@rohanp

This comment has been minimized.

Contributor

rohanp commented Aug 9, 2017

Hi @jorisvandenbossche, my changes are causing the tests to fail. I'm still looking into why. I'll remove that once I get to the bottom of what the problem is.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Aug 10, 2017

my changes are causing the tests to fail. I'm still looking into why

see my comment above, isnull was recently be renamed to isna in this file, that is the reason all tests are failing now (but, we also asked to remove the part that uses isnull (or explain why it is needed))

@jreback

couple of things to remove. pls add a whatsnew note (in perf). ping on green.

@@ -342,6 +343,9 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):
#if False and not skipna and isna(np.sum(values)):

This comment has been minimized.

@jreback

jreback Aug 12, 2017

Contributor

remove

@rohanp

This comment has been minimized.

Contributor

rohanp commented Aug 12, 2017

Hi, I had put that in because the tests were failing and I thought that would fix it (but apparently not). The tests are still failing without it.

@rohanp

This comment has been minimized.

Contributor

rohanp commented Aug 19, 2017

@jorisvandenbossche thank you for that catch! I've now double checked that nothing in the diff has changed except for the contents of nanmedian, and the tests are passing on my local computer, but some of the same errors are popping up on the CI tests.

pandas/util/testing.py:1105: AssertionError
___________________________ TestPanel4d.test_median ____________________________

self = <pandas.tests.test_panel4d.TestPanel4d object at 0x7f3358785e50>

def test_median(self):
    def wrapper(x):
        if isna(x).any():
            return np.nan
        return np.median(x)
  self._check_stat_op('median', wrapper)

pandas/tests/test_panel4d.py:54:

@@ -344,12 +344,22 @@ def nanmedian(values, axis=None, skipna=True):
values, mask, dtype, dtype_max = _get_values(values, skipna)
if not skipna:

This comment has been minimized.

@jreback

jreback Aug 19, 2017

Contributor

why are you adding this?

This comment has been minimized.

@rohanp

rohanp Aug 19, 2017

Contributor

The tests compare nanmedian to numpy's median function. If this special case were not there, this function would return a non-null result for input with null values and skipna turned off. Here is a debug log demonstrating this: https://gist.github.com/rohanp/1082ad5a5e199f6b1cb8ade965c4519f

This comment has been minimized.

@jreback

jreback Aug 19, 2017

Contributor

so fix the test to do the correct comparison

This comment has been minimized.

@rohanp

rohanp Aug 19, 2017

Contributor

Isn't the desired behavior for nanmedian to be like median when skipna is turned off though? If not, then what is the skipna flag supposed to do?

This comment has been minimized.

@jreback

jreback Aug 20, 2017

Contributor

yes, so use the skipna flag as a switch when generating which function to use.

@rohanp

This comment has been minimized.

Contributor

rohanp commented Aug 22, 2017

It looks like the tests are passing on when I run pytest pandas locally as well as on AppVeyor, but failing on Travis CI and CircleCI.

Could someone please explain how I can replicate the environment for CircleCI on my local computer so that I can debug? I tried looking in the ci folder but am not sure what I should be looking for (a conda environment.yml?)

@gfyoung

This comment has been minimized.

Member

gfyoung commented Aug 22, 2017

The Travis failures are style-checking errors, which you can see here:

https://travis-ci.org/pandas-dev/pandas/jobs/267057585#L1937

Address those lint issues, and you should get a green-light from Travis.

The Circle failures are coming from Python 2.7. Are you using Python 3.x to test your changes? If so, then you should create a Python 2.x environment (either with virtualenv or conda) in which you should install your changes and test.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Aug 23, 2017

@rohanp The skipna_switch feels very complex to me. It is not possible to achieve the same by doing some switch within the nanmedian function?

You still have failing median tests on Circle CI. Are you able to reproduce that locally?

@jreback

This comment has been minimized.

Contributor

jreback commented Aug 23, 2017

@rohanp yeah I would go back to the previous iteration. This needs to use the version detection as well (you had previously)

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Aug 23, 2017

This needs to use the version detection as well (you had previously)

why? it was for numpy < 1.9, which we just removed?

@rohanp

This comment has been minimized.

Contributor

rohanp commented Aug 23, 2017

Okay, I will undoskipna_switch

I am not able to reproduce the Circle CI bugs locally, even after making a Python 2.7 environment (all the tests pass locally). Here is the conda environment I am using:

apipkg                    1.4                       <pip>
dateutil                  2.4.1                    py27_0  
execnet                   1.4.1                     <pip>
joblib                    0.11                      <pip>
lxml                      3.8.0                     <pip>
mkl                       2017.0.3                      0  
numpy                     1.13.1                   py27_0  
openssl                   1.0.2k                        2  
pip                       9.0.1                    py27_1  
py                        1.4.34                    <pip>
pytest                    3.2.1                     <pip>
pytest-forked             0.2                       <pip>
pytest-xdist              1.20.0                    <pip>
python                    2.7.13                        0  
pytz                      2017.2                   py27_0  
readline                  6.2                           2  
setuptools                27.2.0                   py27_0  
six                       1.10.0                   py27_0  
sqlite                    3.13.0                        0  
tk                        8.5.18                        0  
wheel                     0.29.0                   py27_0  
zlib                      1.2.8                         3 
@jreback

This comment has been minimized.

Contributor

jreback commented Sep 22, 2017

can you rebase / update

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 28, 2017

pls rebase

@jreback

This comment has been minimized.

Contributor

jreback commented Nov 12, 2017

can you rebase

@jreback

This comment has been minimized.

Contributor

jreback commented Dec 11, 2017

master

In [1]:  pd.set_option('use_bottleneck', False)

In [2]: np.random.seed(1234); df = pd.DataFrame(np.random.randn(100000, 4))

In [3]: %timeit df.median(1)
2.86 s ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

PR

In [1]:  pd.set_option('use_bottleneck', False)

In [2]: np.random.seed(1234); df = pd.DataFrame(np.random.randn(100000, 4))

In [3]: %timeit df.median(1)
21.7 ms ± 470 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@jreback jreback added this to the 0.23.0 milestone Jan 21, 2018

@jreback jreback merged commit 3702492 into pandas-dev:master Jan 22, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment