New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add interpolation options to rolling quantile #20497

Merged
merged 16 commits into from Apr 24, 2018

Conversation

Projects
None yet
4 participants
@kornilova-l
Contributor

kornilova-l commented Mar 27, 2018

It version 0.21.0 rolling quantile started to use linear interpolation, it broke backward compatibility.

Regular (not rolling) quantile supports these interpolation options: linear, lower, higher, nearest and midpoint.
This commit adds the same options to moving quantile.

Performance issues of this commit (note: I re-run benchmarks, see message below)
This code has 15% worse performance on benchmarks with small values of window (window=10). This is because loop inside roll_quantile now contains switch.
I tried to replace switch with callback but it led to even worse performance. Even if I move some of the code to new function (without any change in logic) it still makes performance much worse.
How bad is it? Could you please give me an advice on how to arrange the code such that it has the same performance?

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@kornilova-l kornilova-l changed the title from Add interpolation options to moving quantile to Add interpolation options to rolling quantile Mar 27, 2018

@jreback

This comment has been minimized.

Contributor

jreback commented Mar 27, 2018

what issue are you addressing here? if its about performance, pls show representative benchmarks (or run the asv)

it version 0.21.0 rolling quantile started to use linear interpolation, it broke backward compatibility.

if you read this: https://github.com/pandas-dev/pandas/pull/16247/files not sure what to say here, it had different defaults.

@jreback

This comment has been minimized.

Contributor

jreback commented Mar 27, 2018

re-reading your changes I see that you are trying to fix perf. can you run the asv's here and show the results. (you may need to add one if the case you are addressing is not covered).

@kornilova-l

This comment has been minimized.

Contributor

kornilova-l commented Mar 27, 2018

@jreback, hi,

There is no issue related to this pull request. Do I need to create an issue first?

I am aware about the BUG that you referenced. Company at which I work uses rolling quantile without interpolation, when we updated pandas version, we discovered that behavior of rolling quantile has changed and there is no way to get non-interpolated results.

I re-ran asv benchmarks for rolling quantile (default version uses linear interpolation) and it did not show significant changes, sorry for confusing you.

BENCHMARKS NOT SIGNIFICANTLY CHANGED

I added new benchmarks for different interpolation options. I cannot compare them with previous results because interpolation options were not supported (supporting them is the purpose of this commit). Here are new benchmarks:

[ 20.00%] ··· Running rolling.Quantile.time_quantile             1.37±0.2ms;...
[ 40.00%] ··· Running rolling.Quantile.time_quantile_higher      1.38±0.2ms;...
[ 60.00%] ··· Running rolling.Quantile.time_quantile_lower       1.37±0.2ms;...
[ 80.00%] ··· Running rolling.Quantile.time_quantile_midpoint    1.38±0.2ms;...
[100.00%] ··· Running rolling.Quantile.time_quantile_nearest     1.37±0.2ms;...

I am not sure if I should add this new benchmarks to the commit. Should all the code be covered by benchmarks?

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented Mar 27, 2018

@codecov

This comment has been minimized.

codecov bot commented Mar 28, 2018

Codecov Report

Merging #20497 into master will increase coverage by 0.06%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20497      +/-   ##
==========================================
+ Coverage   91.77%   91.83%   +0.06%     
==========================================
  Files         153      153              
  Lines       49263    49320      +57     
==========================================
+ Hits        45213    45295      +82     
+ Misses       4050     4025      -25
Flag Coverage Δ
#multiple 90.23% <100%> (+0.06%) ⬆️
#single 41.87% <60%> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/core/series.py 93.9% <ø> (ø) ⬆️
pandas/core/frame.py 97.17% <ø> (ø) ⬆️
pandas/core/window.py 96.3% <100%> (ø) ⬆️
pandas/core/dtypes/concat.py 99.16% <0%> (-0.02%) ⬇️
pandas/core/strings.py 98.32% <0%> (-0.02%) ⬇️
pandas/api/extensions/__init__.py 100% <0%> (ø) ⬆️
pandas/core/arrays/base.py 84.14% <0%> (ø) ⬆️
pandas/core/indexes/base.py 96.63% <0%> (ø) ⬆️
pandas/core/indexes/datetimelike.py 96.72% <0%> (+0.01%) ⬆️
pandas/core/common.py 92.04% <0%> (+0.02%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0ae19a1...9cc7a71. Read the comment docs.

@jreback

This comment has been minimized.

Contributor

jreback commented Mar 28, 2018

@kornilova-l

There is no issue related to this pull request. Do I need to create an issue first?

Generally create an issue first. Since you already created a PR no need then.

Ok so perf is basically unchanged on small series. If you want to add an asv for that is ok.

@jreback jreback removed the Performance label Mar 28, 2018

@jreback

looking good. comments and please add a line in the Other Enhancements section.

@@ -1357,25 +1357,53 @@ cdef _roll_min_max(ndarray[numeric] input, int64_t win, int64_t minp,
return output
def _get_interpolation_id(str interpolation):

This comment has been minimized.

@jreback

jreback Mar 28, 2018

Contributor

look in algos.pxd for how we handle cdef enum TiebreakEnumType:

e.g you create an enum and a dict to map to that enum. This becomes a bit simpler then.

idx_with_fraction = quantile * <double> (nobs - 1)
idx = int(idx_with_fraction)
if interpolation_id == 0: # linear

This comment has been minimized.

@jreback

jreback Mar 28, 2018

Contributor

so you would use the enums here

if quantile <= 0.0 or quantile >= 1.0:
raise ValueError("quantile value {0} not in [0, 1]".format(quantile))
# interpolation_id is needed to avoid string comparisons inside the loop

This comment has been minimized.

@jreback

jreback Mar 28, 2018

Contributor

don't need the comments here

* lower: `i`.
* higher: `j`.
* nearest: `i` or `j` whichever is nearest.
* midpoint: (`i` + `j`) / 2.""")

This comment has been minimized.

@jreback

jreback Mar 28, 2018

Contributor

can you add a Returns, Examples and a See Also section (look at the Series.quantile doc-string for inspiration). Also add a reference from the Series.quantile doc-string to here

@@ -1135,7 +1135,22 @@ def test_rolling_quantile_series(self):
s = Series(arr)
q1 = s.quantile(0.1)
q2 = s.rolling(100).quantile(0.1).iloc[-1]
tm.assert_almost_equal(q1, q2)

This comment has been minimized.

@jreback

jreback Mar 28, 2018

Contributor

can you move to a separate test and parameterize on the interpolation (inlcude linear as well)

This comment has been minimized.

@jreback

jreback Mar 28, 2018

Contributor

these are all coming to the same result value, is that right?

@jreback jreback added the Enhancement label Mar 28, 2018

@jreback

can you add a whatsnew note (other enhancements section)

2 2.0
3 3.0
dtype: float64
>>> s.rolling(2).quantile(.4, interpolation='midpoint')

This comment has been minimized.

@jreback

jreback Mar 30, 2018

Contributor

can you add a blank line here

@jreback

This comment has been minimized.

Contributor

jreback commented Mar 30, 2018

linting issue:

pandas/core/window.py:1289:1: W293 blank line contains whitespace

you can run
LINT=TRUE ci/lint.sh to see

@jreback

This comment has been minimized.

Contributor

jreback commented Mar 30, 2018

@TomAugspurger comments?

@jreback

This comment has been minimized.

Contributor

jreback commented Mar 30, 2018

@WillAyd if you'd have a look

@WillAyd

Couple comments on my end - is it also possible to use nogil? May help with some of the performance regression?

if quantile <= 0.0 or quantile >= 1.0:
raise ValueError("quantile value {0} not in [0, 1]".format(quantile))
try:
interpolation_type = interpolation_types[interpolation]

This comment has been minimized.

@WillAyd

WillAyd Mar 30, 2018

Member

Wouldn't this raise a KeyError not a ValueError?

output[i] = ((vlow + (vhigh - vlow) *
(quantile * (nobs - 1) - idx)))
idx_with_fraction = quantile * <double> (nobs - 1)
idx = int(idx_with_fraction)

This comment has been minimized.

@WillAyd

WillAyd Mar 30, 2018

Member

Given everything else here is done with C syntax, can we use that casting convention here?

@@ -1138,6 +1138,20 @@ def test_rolling_quantile_series(self):
tm.assert_almost_equal(q1, q2)
@pytest.mark.parametrize('quantile', [0.0, 0.1, 0.45, 0.5, 1])

This comment has been minimized.

@WillAyd

WillAyd Mar 30, 2018

Member

Can we add examples with NA data?

@kornilova-l

This comment has been minimized.

Contributor

kornilova-l commented Apr 1, 2018

@WillAyd I tried to replace range with prange but it does not compile because calling skiplist.get requires gil: Calling gil-requiring function not allowed without gil did I miss something?

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 1, 2018

ok we don't / can't use prange anywhere in the codebase. only you can use nogil

@kornilova-l

This comment has been minimized.

Contributor

kornilova-l commented Apr 1, 2018

@jreback but still using nogil is not possible because there are gil-requiring function calls

@WillAyd

This comment has been minimized.

Member

WillAyd commented Apr 1, 2018

The only gil-requiring function call at the time of that comment was int which is why I asked if it was possible to use the C-style cast instead. I think if you did those in combination it would work. I see that you since added the round built-in so to get nogil to work you'd have to use a C-style version of that as well.

I did something similar in https://github.com/pandas-dev/pandas/pull/20405/files which you may be able to reference

# Tests that rolling window's quantile behavior is analogous to
# Series' quantile for each interpolation option
size = 100
s = Series(np.random.rand(size))

This comment has been minimized.

@WillAyd

WillAyd Apr 1, 2018

Member

I'd advise against using random data in a test case as it could cause intermittent failures and make bugs harder to find. To inject missing data, could you not just parametrize the function with a second array of say [0., np.nan, 0.2, np.nan, 0.4] to match what you had earlier?

if quantile <= 0.0 or quantile >= 1.0:
raise ValueError("quantile value {0} not in [0, 1]".format(quantile))
try:
interpolation_type = interpolation_types[interpolation]
except KeyError:

This comment has been minimized.

@WillAyd

WillAyd Apr 1, 2018

Member

Is there a test case to cover that this raises the expected error message when passing an invalid argument? If not can you add?

This comment has been minimized.

@WillAyd

WillAyd Apr 16, 2018

Member

Minor nit but can you place the name of the passed interpolation in single quotes? Helps distinguish it from the rest of the text in the error message (will need to update test as well)

q2 = s.rolling(size, min_periods=1).quantile(
quantile, interpolation).iloc[-1]
tm.assert_almost_equal(q1, q2)

This comment has been minimized.

@WillAyd

WillAyd Apr 1, 2018

Member

Is almost equal required here or is it possible to use input data so that float precision would not be an issue? I'm trying to be extra cautious that almost_equaldoesn't allow bugs to silently pass through some interpolation options

@kornilova-l

This comment has been minimized.

Contributor

kornilova-l commented Apr 13, 2018

About nogil: I changed int() to <int> and also removed round() function, but after adding nogil compiler still complains about skiplist methods (skiplist.insert, skiplist.remove, skiplist.get).

One of compiler messages:

Error compiling Cython file:
------------------------------------------------------------
...
                        elif idx_with_fraction - idx < 0.5:
                            output[i] = skiplist.get(idx)
                        else:
                            output[i] = skiplist.get(idx + 1)
                    elif interpolation_type == MIDPOINT:
                        vlow = skiplist.get(idx)
                                          ^
------------------------------------------------------------

pandas/_libs/window.pyx:1474:43: Calling gil-requiring function not allowed without gil
else:
output[i] = skiplist.get(idx + 1)
elif interpolation_type == MIDPOINT:
vlow = skiplist.get(idx)

This comment has been minimized.

@jreback

jreback Apr 13, 2018

Contributor

to make this work w/o the gil, you need to call skiplist_get(skiplist, idx) (there are examples in the file). skiplist.get invokes a python function call and is not allowed

@kornilova-l

This comment has been minimized.

Contributor

kornilova-l commented Apr 13, 2018

Finally code works with nogil

There is another problem:
Some tests in python 2.7 environment fail. All failed tests use midpoint interpolation. I am almost sure that it is related to numpy/numpy#7163
Tests compare results of quantile and rolling_quantile. quantile uses np.percentile which used to have a bug in 'midpoint' interpolation.

What should I do with failed tests? Is it okay to test midpoint separately and not compare it to quantile?

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 13, 2018

just skip if < numpy 1.12 on those tests

@@ -6,6 +6,7 @@
from datetime import datetime, timedelta
from numpy.random import randn
import numpy as np
from scipy._lib._version import NumpyVersion

This comment has been minimized.

@jreback

jreback Apr 16, 2018

Contributor

use from pandas import _np_version_under1p12

q2 = s.rolling(len(data), min_periods=1).quantile(
quantile, interpolation).iloc[-1]
if np.isnan(q1):

This comment has been minimized.

@WillAyd

WillAyd Apr 16, 2018

Member

Is tm.assert_series_equal not an option here?

This comment has been minimized.

@kornilova-l

kornilova-l Apr 17, 2018

Contributor

@WillAyd, it is not series type. I edited test data a bit and was able to get rid of round

[np.nan, np.nan, np.nan, np.nan],
[np.nan, 0.1, np.nan, 0.3, 0.4, 0.5],
[0.5], [np.nan, 0.7, 0.5]])
def test_rolling_quantile_interpolation_options(self, quantile,

This comment has been minimized.

@WillAyd

WillAyd Apr 16, 2018

Member

I think this and the above test cover practically the same use case. If that's the case I'd get rid of the test above

if quantile <= 0.0 or quantile >= 1.0:
raise ValueError("quantile value {0} not in [0, 1]".format(quantile))
try:
interpolation_type = interpolation_types[interpolation]
except KeyError:

This comment has been minimized.

@WillAyd

WillAyd Apr 16, 2018

Member

Minor nit but can you place the name of the passed interpolation in single quotes? Helps distinguish it from the rest of the text in the error message (will need to update test as well)

@WillAyd

Couple minor edits otherwise lgtm. Can you post updated ASVs?

* lower: `i`.
* higher: `j`.
* nearest: `i` or `j` whichever is nearest. Implementation uses
round() built-in.

This comment has been minimized.

@WillAyd

WillAyd Apr 18, 2018

Member

Can remove this comment about the round() built-in

-------
Series or DataFrame
Returned object type is determined by the caller of the %(name)s
calculation

This comment has been minimized.

@WillAyd

WillAyd Apr 18, 2018

Member

Very minor but description should end with a period

See Also
--------
pandas.Series.quantile

This comment has been minimized.

@WillAyd

WillAyd Apr 18, 2018

Member

Just add some quick descriptions after these (separated by colon). Can reference docstring guide:

https://python-sprints.github.io/pandas/guide/pandas_docstring.html#section-5-see-also

s = Series(data)
q1 = s.quantile(quantile, interpolation)
q2 = s.rolling(len(data), min_periods=1).quantile(

This comment has been minimized.

@WillAyd

WillAyd Apr 18, 2018

Member

I suppose this is just expanding instead of rolling so should use the former to stay idiomatic

arr = np.random.random(N).astype(dtype)
self.roll = getattr(pd, constructor)(arr).rolling(window)
def time_quantile(self, constructor, window, dtype, percentile):
self.roll.quantile(percentile)
def time_quantile_nearest(self, constructor, window, dtype, percentile):

This comment has been minimized.

@WillAyd

WillAyd Apr 19, 2018

Member

Ha when I asked for updated ASVs I was just asking for a refresh of what you posted earlier in the thread but I suppose this is better. Is it possible to just add interpolation as an argument to setup? Would be more concise.

Here's an example from GroupBy that you could use for reference, though the existing Quantile class could clue you in on how to do this as well:

params = [['int', 'float', 'object', 'datetime'],

This comment has been minimized.

@WillAyd

WillAyd Apr 19, 2018

Member

Please also post results to a comment when done. I'm assuming that these will fail on master but would still be good to know how they baseline against the current implementation that you posted earlier in thread

@jreback

can you rebase and address small comments by @WillAyd, otherwise lgtm.

@@ -402,6 +402,7 @@ Other Enhancements
- :meth:`DataFrame.to_sql` now performs a multivalue insert if the underlying connection supports itk rather than inserting row by row.
``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`)
- :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`)
- :meth:`Rolling.quantile` and :meth:`Expanding.quantile` now accept ``interpolation`` keyword

This comment has been minimized.

@jreback

jreback Apr 21, 2018

Contributor

@jorisvandenbossche do these refs work?

This comment has been minimized.

@kornilova-l

kornilova-l Apr 23, 2018

Contributor

@jreback, how should they work? I built documentation and all references in the whatsnew.html are rendered as plain bold text.

@@ -402,6 +402,7 @@ Other Enhancements
- :meth:`DataFrame.to_sql` now performs a multivalue insert if the underlying connection supports itk rather than inserting row by row.
``SQLAlchemy`` dialects supporting multivalue inserts include: ``mysql``, ``postgresql``, ``sqlite`` and any dialect with ``supports_multivalues_insert``. (:issue:`14315`, :issue:`8953`)
- :func:`read_html` now accepts a ``displayed_only`` keyword argument to controls whether or not hidden elements are parsed (``True`` by default) (:issue:`20027`)
- :meth:`Rolling.quantile` and :meth:`Expanding.quantile` now accept ``interpolation`` keyword

This comment has been minimized.

@jreback

jreback Apr 21, 2018

Contributor

can you add (this) issue number here

@kornilova-l

This comment has been minimized.

Contributor

kornilova-l commented Apr 23, 2018

In ASV benchmarks I added interpolation as a parameter to setup. Performance is the same compared to results that I posted earlier

[100.00%] ··· Running rolling.Quantile.time_quantile                 1.38±0.2ms;...

I will rebase branch

@kornilova-l

This comment has been minimized.

Contributor

kornilova-l commented Apr 23, 2018

Now there are lots of commits from master :( hope I did rebasing correctly

@WillAyd

This comment has been minimized.

Member

WillAyd commented Apr 23, 2018

Shouldn't be including the commits of other contributors - how did you rebase this? Need to figure out how to undo that and only replay your commits on top of master

@kornilova-l

This comment has been minimized.

Contributor

kornilova-l commented Apr 24, 2018

Finally I did rebasing correctly (I hope)
Should I squash these 13 commits into one?

@jreback jreback added this to the 0.23.0 milestone Apr 24, 2018

doc

@jreback jreback merged commit ce8f6e8 into pandas-dev:master Apr 24, 2018

0 of 3 checks passed

ci/circleci Your tests are queued behind your running builds
Details
continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
@jreback

This comment has been minimized.

Contributor

jreback commented Apr 24, 2018

thanks @kornilova-l nice patch!

generally you don't need to squash, you can do it if you want to make things more readable / easier if you want. but not necessary to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment