Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: optimized median func when bottleneck not present #16509

Merged
merged 18 commits into from Jan 22, 2018

Conversation

@rohanp
Copy link
Contributor

@rohanp rohanp commented May 25, 2017

closes #16468

Uses np.nanmedian to compute median instead of internal algos.median when bottleneck is not present/turned off. Has order of magnitude performance benefits

df = pd.DataFrame(np.random.randn(10000, 2), columns=list('AB'))

Before

>>> pd.set_option('use_bottleneck', False)
>>> %timeit df.median(1)
318 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit pd.Series(np.nanmedian(df.values, axis=1), index=df.index)
1.83 ms ± 123 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> pd.set_option('use_bottleneck', True)
>>> %timeit df.median(1)
239 µs ± 2.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

After

>>> pd.set_option('use_bottleneck', False)
>>> %timeit df.median(1)
1.89 ms ± 53.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> pd.set_option('use_bottleneck', True)
>>> %timeit df.median(1)
227 µs ± 11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
values, mask, dtype, dtype_max = _get_values(values, skipna)

def get_median(x):

This comment has been minimized.

@jreback

jreback May 25, 2017
Contributor

and this will break in numpy < 1.9, you need to do this conditionally.

@jreback
Copy link
Contributor

@jreback jreback commented May 25, 2017

needs an asv (run with and w/o bottleneck)

@rohanp
Copy link
Contributor Author

@rohanp rohanp commented May 25, 2017

Do i need to run all the tests or can i get away with a subset?

@rohanp
Copy link
Contributor Author

@rohanp rohanp commented May 25, 2017

Also do I have to manually turn bottleneck on and off for the tests?

@jreback
Copy link
Contributor

@jreback jreback commented May 25, 2017

Also do I have to manually turn bottleneck on and off for the tests?

have 2 tests, 1 with it forced on, 1 with forced off

Do i need to run all the tests or can i get away with a subset?

just the median tests :>

@jreback
Copy link
Contributor

@jreback jreback commented May 25, 2017

actually I don't think bottleneck is in our asv config. (separate issue is it should be)

@rohanp
Copy link
Contributor Author

@rohanp rohanp commented May 25, 2017

just the median tests :>

does that just entail test_nanops.py?

@TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented May 26, 2017

@rohanp asv continuous -f 1.1 upstream/master HEAD -b median should work. See http://pandas.pydata.org/pandas-docs/stable/contributing.html#running-the-performance-test-suite

@rohanp
Copy link
Contributor Author

@rohanp rohanp commented May 27, 2017

I don't understand why the tests are failing. How do I see the test cases that failed?

@jreback
Copy link
Contributor

@jreback jreback commented May 27, 2017

click on the red x

@rohanp
Copy link
Contributor Author

@rohanp rohanp commented May 29, 2017

I'm not quite sure what you mean. I figured out how to see the details of the tests that failed; for example on circleci I can see

self = <pandas.tests.test_nanops.TestnanopsDataFrame object at 0x7f67d2d65c88>

    def test_nanmedian(self):
        with warnings.catch_warnings(record=True):
            self.check_funs(nanops.nanmedian, np.median, allow_complex=False,
                            allow_str=False, allow_date=False,
>                           allow_tdelta=True, allow_obj='convert')

pandas/tests/test_nanops.py:374: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
pandas/tests/test_nanops.py:245: in check_funs
    **kwargs)
pandas/tests/test_nanops.py:234: in check_fun
    targarnanval, **kwargs)
pandas/tests/test_nanops.py:193: in check_fun_data
    check_dtype=check_dtype)
pandas/tests/test_nanops.py:146: in check_results
    tm.assert_almost_equal(targ, res, check_dtype=check_dtype)
pandas/util/testing.py:234: in assert_almost_equal
    **kwargs)
pandas/_libs/testing.pyx:59: in pandas._libs.testing.assert_almost_equal (pandas/_libs/testing.c:4156)
    cpdef assert_almost_equal(a, b,
pandas/_libs/testing.pyx:173: in pandas._libs.testing.assert_almost_equal (pandas/_libs/testing.c:3274)
    raise_assert_detail(obj, msg, lobj, robj)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

obj = 'numpy array', message = 'numpy array values are different (20.0 %)'
left = '[[nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan], [nan, nan, nan, nan, nan]]'
right = '[[0.0511871111488, -0.458724351947, -0.0208572588354, 0.00471620824214, -0.487450340243], [0.254176310551, -0.0309721..., -0.2722898607, 0.174928469595], [-0.431481894512, 0.245270844279, 0.140055994353, -0.0727601005076, 0.238682362073]]'

But what I do not understand is how to find the input that caused the test to fail. I checked the test_nanops.py file and while it has the tests it does not seem to have the test cases.

@TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented May 30, 2017

@rohanp these tests have several layers of indirection (to allow more code reuse). I would try pytest pandsa/tests/test_nanops.py -k test_nanmedian --pdb to drop into a debugger on failure so you can inspect what's going on. But the basic idea is test_nanmedian -> check_funs -> check_fun -> check_fun_data -> check_results, which is failing for one of the input array.

The actual arrays will be something like self.arr_float or arr_float_nan and are accessed with getattr.

@jreback
Copy link
Contributor

@jreback jreback commented Jun 10, 2017

can you rebase and update?

@jreback
Copy link
Contributor

@jreback jreback commented Jul 7, 2017

can you rebase and update

@jreback
Copy link
Contributor

@jreback jreback commented Jul 19, 2017

can you rebase / update according to comments

@pep8speaks
Copy link

@pep8speaks pep8speaks commented Aug 8, 2017

Hello @rohanp! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 21, 2018 at 21:03 Hours UTC
@rohanp
Copy link
Contributor Author

@rohanp rohanp commented Aug 8, 2017

Hi, sorry for the delay. I think I figured out the bug. Thanks @TomAugspurger for the debugging command. How do I rebase and update?

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 8, 2017

@rohanp you can do the following (assuming that pandas-dev/pandas remote repo is called 'upstream')

git fetch upstream
git checkout rohanp
git merge upstream/master # after this you can fix merge conflicts, and do 'git commit' to continue
git push origin rohanp
@@ -36,6 +37,7 @@ def set_use_bottleneck(v=True):
_USE_BOTTLENECK = v



This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Aug 8, 2017
Member

you can remove this blank line? (in general, please try to avoid such changes on lines that you are not changing for this PR)

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):

if not skipna and isnull(np.sum(values)):
return np.median(values, axis=axis)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Aug 8, 2017
Member

Is this special case needed? As np.nanmedian is almost as fast as np.median, also when there are no NaNs.

This comment has been minimized.

@jreback

jreback Aug 8, 2017
Contributor

this is not needed

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Aug 9, 2017
Member

@rohanp Can you deal with this comment? (remove or give some explanation on why you think it is needed)

BTW, isnull was recently be renamed to isna in this file, that is the reason all tests are failing now.

Copy link
Contributor

@jreback jreback left a comment

need a whatsnew note in performance section.

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):

if not skipna and isnull(np.sum(values)):
return np.median(values, axis=axis)

values, mask, dtype, dtype_max = _get_values(values, skipna)

This comment has been minimized.

@jreback

jreback Aug 8, 2017
Contributor

use _np_version_under1p19 which are already defined

This comment has been minimized.

@jreback

jreback Aug 8, 2017
Contributor

then define

if _np_under_version1p19:
       agg1d = get_median
       aggaxis = lambda values, axis: np.apply_along_axis(get_median, axis, values)
else:
      agg1d = np.nanmedian
      aggaxis = lambda values, axis: np.nanmedian(values, axis)

then the code is almost the same as before, just replace with agg1d and aggaxis

@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):

if not skipna and isnull(np.sum(values)):
return np.median(values, axis=axis)

This comment has been minimized.

@jreback

jreback Aug 8, 2017
Contributor

this is not needed

@jorisvandenbossche jorisvandenbossche added this to the 0.21.0 milestone Aug 9, 2017
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

update looks good, please have a look at the remaining comment

Further:

  • can you add a whatsnew note? (in performance improvements section)
@@ -326,14 +329,19 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):

if not skipna and isnull(np.sum(values)):
return np.median(values, axis=axis)

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Aug 9, 2017
Member

@rohanp Can you deal with this comment? (remove or give some explanation on why you think it is needed)

BTW, isnull was recently be renamed to isna in this file, that is the reason all tests are failing now.

@@ -189,6 +189,10 @@ def check_fun_data(self, testfunc, targfunc, testarval, targarval,
targ = targfunc(targartempval, axis=axis, **kwargs)
res = testfunc(testarval, axis=axis, skipna=skipna,
**kwargs)

#if not np.allclose(targ, res):
# import pdb; pdb.set_trace()

This comment has been minimized.

@jorisvandenbossche

jorisvandenbossche Aug 9, 2017
Member

leftover to remove

@rohanp
Copy link
Contributor Author

@rohanp rohanp commented Aug 9, 2017

Hi @jorisvandenbossche, my changes are causing the tests to fail. I'm still looking into why. I'll remove that once I get to the bottom of what the problem is.

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 10, 2017

my changes are causing the tests to fail. I'm still looking into why

see my comment above, isnull was recently be renamed to isna in this file, that is the reason all tests are failing now (but, we also asked to remove the part that uses isnull (or explain why it is needed))

Copy link
Contributor

@jreback jreback left a comment

couple of things to remove. pls add a whatsnew note (in perf). ping on green.

@@ -342,6 +343,9 @@ def nanmean(values, axis=None, skipna=True):
@bottleneck_switch()
def nanmedian(values, axis=None, skipna=True):

#if False and not skipna and isna(np.sum(values)):

This comment has been minimized.

@jreback

jreback Aug 12, 2017
Contributor

remove

@rohanp
Copy link
Contributor Author

@rohanp rohanp commented Aug 12, 2017

Hi, I had put that in because the tests were failing and I thought that would fix it (but apparently not). The tests are still failing without it.

@@ -344,12 +344,22 @@ def nanmedian(values, axis=None, skipna=True):

values, mask, dtype, dtype_max = _get_values(values, skipna)

if not skipna:

This comment has been minimized.

@jreback

jreback Aug 19, 2017
Contributor

why are you adding this?

This comment has been minimized.

@rohanp

rohanp Aug 19, 2017
Author Contributor

The tests compare nanmedian to numpy's median function. If this special case were not there, this function would return a non-null result for input with null values and skipna turned off. Here is a debug log demonstrating this: https://gist.github.com/rohanp/1082ad5a5e199f6b1cb8ade965c4519f

This comment has been minimized.

@jreback

jreback Aug 19, 2017
Contributor

so fix the test to do the correct comparison

This comment has been minimized.

@rohanp

rohanp Aug 19, 2017
Author Contributor

Isn't the desired behavior for nanmedian to be like median when skipna is turned off though? If not, then what is the skipna flag supposed to do?

This comment has been minimized.

@jreback

jreback Aug 20, 2017
Contributor

yes, so use the skipna flag as a switch when generating which function to use.

@rohanp
Copy link
Contributor Author

@rohanp rohanp commented Aug 22, 2017

It looks like the tests are passing on when I run pytest pandas locally as well as on AppVeyor, but failing on Travis CI and CircleCI.

Could someone please explain how I can replicate the environment for CircleCI on my local computer so that I can debug? I tried looking in the ci folder but am not sure what I should be looking for (a conda environment.yml?)

@gfyoung
Copy link
Member

@gfyoung gfyoung commented Aug 22, 2017

The Travis failures are style-checking errors, which you can see here:

https://travis-ci.org/pandas-dev/pandas/jobs/267057585#L1937

Address those lint issues, and you should get a green-light from Travis.

The Circle failures are coming from Python 2.7. Are you using Python 3.x to test your changes? If so, then you should create a Python 2.x environment (either with virtualenv or conda) in which you should install your changes and test.

@rohanp rohanp force-pushed the rohanp:rohanp branch from 1c7fced to 23bbe73 Aug 23, 2017
@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 23, 2017

@rohanp The skipna_switch feels very complex to me. It is not possible to achieve the same by doing some switch within the nanmedian function?

You still have failing median tests on Circle CI. Are you able to reproduce that locally?

@jreback
Copy link
Contributor

@jreback jreback commented Aug 23, 2017

@rohanp yeah I would go back to the previous iteration. This needs to use the version detection as well (you had previously)

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 23, 2017

This needs to use the version detection as well (you had previously)

why? it was for numpy < 1.9, which we just removed?

@rohanp
Copy link
Contributor Author

@rohanp rohanp commented Aug 23, 2017

Okay, I will undoskipna_switch

I am not able to reproduce the Circle CI bugs locally, even after making a Python 2.7 environment (all the tests pass locally). Here is the conda environment I am using:

apipkg                    1.4                       <pip>
dateutil                  2.4.1                    py27_0  
execnet                   1.4.1                     <pip>
joblib                    0.11                      <pip>
lxml                      3.8.0                     <pip>
mkl                       2017.0.3                      0  
numpy                     1.13.1                   py27_0  
openssl                   1.0.2k                        2  
pip                       9.0.1                    py27_1  
py                        1.4.34                    <pip>
pytest                    3.2.1                     <pip>
pytest-forked             0.2                       <pip>
pytest-xdist              1.20.0                    <pip>
python                    2.7.13                        0  
pytz                      2017.2                   py27_0  
readline                  6.2                           2  
setuptools                27.2.0                   py27_0  
six                       1.10.0                   py27_0  
sqlite                    3.13.0                        0  
tk                        8.5.18                        0  
wheel                     0.29.0                   py27_0  
zlib                      1.2.8                         3 
@jreback
Copy link
Contributor

@jreback jreback commented Sep 22, 2017

can you rebase / update

@jreback
Copy link
Contributor

@jreback jreback commented Oct 28, 2017

pls rebase

@jreback
Copy link
Contributor

@jreback jreback commented Nov 12, 2017

can you rebase

@jreback
Copy link
Contributor

@jreback jreback commented Dec 11, 2017

master

In [1]:  pd.set_option('use_bottleneck', False)

In [2]: np.random.seed(1234); df = pd.DataFrame(np.random.randn(100000, 4))

In [3]: %timeit df.median(1)
2.86 s ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

PR

In [1]:  pd.set_option('use_bottleneck', False)

In [2]: np.random.seed(1234); df = pd.DataFrame(np.random.randn(100000, 4))

In [3]: %timeit df.median(1)
21.7 ms ± 470 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
@jreback jreback added this to the 0.23.0 milestone Jan 21, 2018
@jreback jreback merged commit 3702492 into pandas-dev:master Jan 22, 2018
3 checks passed
3 checks passed
ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

6 participants
You can’t perform that action at this time.