Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use z-scores in rolling skew/kurt calculations #8270

Closed
wants to merge 1 commit into from

Conversation

behzadnouri
Copy link
Contributor

adds some stability to rolling skew and kurt calculation by using the z-scores instead of original values; related: #6929

on master:

In [3]: np.random.seed(2718281)
In [4]: xs = np.random.rand(10)

In [5]: pd.stats.moments.rolling_skew(xs, 4) - pd.stats.moments.rolling_skew(xs + 5000, 4)
Out[5]: 
array([             nan,              nan,              nan,
        -1.64816426e-03,   1.10392004e-03,   1.09544324e-03,
         7.99253200e-03,   8.66733795e-05,  -4.48450842e-03,
        -6.03475754e-03])

In [6]: pd.stats.moments.rolling_kurt(xs, 4) - pd.stats.moments.rolling_kurt(xs + 100, 4)
Out[6]: 
array([             nan,              nan,              nan,
        -1.39880824e-04,  -2.91491317e-04,  -2.35463343e-04,
        -7.36046530e-04,  -3.99141811e-04,  -8.61177376e-05,
         9.26058376e-06])

on branch:

In [7]: pd.stats.moments.rolling_skew(xs, 4) - pd.stats.moments.rolling_skew(xs + 5000, 4)
Out[7]: 
array([             nan,              nan,              nan,
        -1.03450581e-12,   7.36521955e-13,   2.42605935e-12,
        -2.43799425e-12,  -1.85962357e-12,  -1.06048503e-12,
         1.66638925e-12])

In [8]: pd.stats.moments.rolling_kurt(xs, 4) - pd.stats.moments.rolling_kurt(xs + 100, 4)
Out[8]: 
array([             nan,              nan,              nan,
        -1.24344979e-13,  -7.99360578e-14,   6.75015599e-14,
        -9.05941988e-14,  -5.15143483e-14,  -1.11910481e-13,
        -5.15143483e-14])

This does not handle all situations that causes instability, but I think it is generally more stable to work with z-scores; In particular, affine transformations generate same results.

Also, input is a python 3 built-in. I suggest renaming all instances of input in algo.pyx file to inp or something else.

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

Note also #8086.

@behzadnouri
Copy link
Contributor Author

@seth-p FYI, a constant series has zero standard deviation, so the skew should be nan not zero. so, that is a bug in Series.skew not rolling/expanding_skew and this PR is independent of #8086

@seth-p
Copy link
Contributor

seth-p commented Sep 15, 2014

Yep, separate issue. Just hoping someone (:-)) will want to unify Series.skew and rolling/expanding_skew so that they rely on a single code base and are consistent. And similarly with the kurt() functions.

@jreback
Copy link
Contributor

jreback commented Sep 16, 2014

@behzadnouri can you create a separate issue about Series.skew returning nan (not 0) for a constant series?

@jreback jreback added the Numeric Operations Arithmetic, Comparison, and Logical operations label Sep 16, 2014
@jreback
Copy link
Contributor

jreback commented Sep 16, 2014

@behzadnouri tests!

@jreback
Copy link
Contributor

jreback commented Sep 16, 2014

does ths close #6929 or is their a further extension needed and ths is a partial cas

@jreback
Copy link
Contributor

jreback commented Sep 16, 2014

cc @jaimefro

@jreback
Copy link
Contributor

jreback commented Sep 16, 2014

@behzadnouri I realized #8086 is this issue @seth-p pointing at (unifying skew)

@behzadnouri
Copy link
Contributor Author

@jreback tests added

@jreback jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Sep 26, 2014
@jreback jreback added this to the 0.15.1 milestone Sep 26, 2014
@jreback
Copy link
Contributor

jreback commented Sep 26, 2014

well, I think a more complete soln to #6929 is in order no?

@seth-p
cc @jaimefrio

@jaimefrio
Copy link
Contributor

You can do much, much better... I have a PR halfway through here, which uses the Welford-like algorithm described here. The case where and observation is removed and another added is not worked out there (or anywhere else as far as I can tell) and you have to figure it out on your own.

@behzadnouri
Copy link
Contributor Author

@jaimefrio I tried to take that route before making my code changes and that is definitely the right way to go if the code was doing expanding window moments not rolling window skew/kurt while handling nan values gracefully.

take kurtosis for example, you need to maintain effective number of observations n (that is the count of not null values within the window), and 4 central moments. that is 5 updating equations which is fine. but there are 4 cases for each equation based on the new value and dropping value being null or not. that is 5 * 4 = 20 equations with all their corner cases and chances of making mistakes. (think about which n you should use to update the moments if the first value is null versus the new value is null, new n or old n)

and even doing all that, at the end of the day, it is not like:

You can do much, much better...

no! definitely not. for all real world series that I have seen, and any time series which does not change scale/location with multiple orders of magnitude z-scores will do the job with minimal code complexity and a lot more code maintainability.

@jaimefrio
Copy link
Contributor

I don't know... A mock up of your solution runs on my system a little over 2x slower than the current code. And while it improves stability for what arguably are the most common use cases, it is still going to fail for a number of not-so-pathological ones, e.g. x= np.random.rand(1000); x[500:] += 10000; rolling_skew(x, 100, 0) is all wrong. That could be OK, were it not because there is an algorithm that provides superior stability, and better performance (roll_var was slowed down about 15%, if memory serves me well).

I am not much of a Pandas user or contributor, so I may be getting the philosophy all wrong. But on a project that uses home brewed implementations of skip lists (for roll_median) and of circular deques (for roll_min and roll_max), and that advertises itself as "providing high-performance, easy-to-use data structures", choosing an objectively worse algorithm because it makes code less complex and more maintainable seems like a pretty weak argument. Especially since you have the complexity packaged away inside a function, that can very easily be checked to ensure that all code paths are working fine with 10 or 20 lines of test code.

Just to make it explicitly clear, I am -1 on this approach. But if it is deemed the right one, I would suggest that you put your code in the Python wrapper, as I really don't think you are taking any advantage of Cython.

@jreback
Copy link
Contributor

jreback commented Sep 26, 2014

@jaimefrio your impl that you pointed above seems pretty straightforward and you have nicely refactored out the counting issue. I think it seems pretty reasonable and is likely to have similar perf to the current routines, pls continue work and submit a PR when ready.

@behzadnouri
Copy link
Contributor Author

@jaimefrio what you are calling "not-so-pathological" has a jump more thank 34K of trailing standard deviation:

In [28]: np.diff(x)[499] / x[:500].std()
Out[28]: 34371.41

what actual real-world time series makes such a jump and still someone would ask "lets look at skew numbers"?

choosing an objectively worse algorithm because it makes code less complex and more maintainable seems like a pretty weak argument

you are calling it "objectively worse" based on a contrived example, while at the same time admitting

improves stability for what arguably are the most common use cases,

and looking at this i honestly do not know what to say for what you call "pretty weak argument".

@jaimefrio
Copy link
Contributor

@behzadnouri I was calling it objectively worse because I timed it an it run 2x slower than the current code. Which is kind of surprisingly fast, since you are iterating at least eight times over the input array before spitting an answer.

@jaimefrio
Copy link
Contributor

It turns out that it isn't that hard to come up with very simple examples that fail big time:

import pandas as pd
from pandas.algos import roll_kurt
import numpy as np
from scipy.stats import kurtosis

# Sanity check
x = np.random.rand(1e6)
y = x - x.mean(); y /= y.std()
>>> roll_kurt(x, 1000, 0)[-1]
-1.1175114235008137
>>> roll_kurt(y, 1000, 0)[-1]
-1.1175114234982957
>>> kurtosis(y[-1000:], bias=False)
-1.1175114234980807

# Meltdown
x = np.linspace(0, 1000, 1e6)
y = x - x.mean(); y /= y.std()
>>> roll_kurt(x, 1000, 0)[-1]
10.397302582213468
>>> roll_kurt(y, 1000, 0)[-1]
-0.46924667029766476
>>> kurtosis(x[-1000:], bias=False)
-1.2000000000000006

@behzadnouri
Copy link
Contributor Author

@jaimefrio the numbers you are quoting do not come out of this code (especially that 10.39... which is extremely off from output of this patch). you may want to check your mock up code.

That you have got even this simpler algorithm wrong, proves my point of why code simplicity and maintainability matters.

also, you are looking at a series with such characteristics:

In [48]: (x[-1000:].std()) / x.std()
Out[48]: 0.0010

In [49]: (x[-1000:].mean() - x[:1000].mean()) / x.std()
Out[49]: 3.4606

In [50]: (x[-1000:].mean() - x[:1000].mean()) / x[-1000:].std() # same as `/ x[:1000].std()`
Out[50]: 3460.6

Do you honestly believe that a series which changes location around 3.5 times of its standard deviation (3460 of rolling window standard deviation) is a use case for kurtosis as a measure of tail heaviness? does anyone out there care if tails are heavy when the location can already move by orders of magnitude?

I am happy that you mentioned earlier that this patch

improves stability for what arguably are the most common use cases,

and that your mock up code passes sanity check for the sane example. That said, I do not feel this discussion is an effective use of our time. If you think I am wrong, please spend time on implementing what you think is right, rather than trying to prove me wrong.

Please do not continue this argument, and instead focus on correct implementation of what you think is a better path.

@jaimefrio
Copy link
Contributor

Trust me, @behzadnouri, I have better things to do than proving you wrong. This is supposed to be a collaborative effort to produce great software, not a contest to see who has the bestest PR. Let's try to get back to that...

I have done some more tests over the weekend, and it is terribly easy to find simple examples with terrible behavior. Perhaps the most illustrative is a long enough linear ramp, with a sufficiently small window. The kurtosis should be -1.2 for any window size, but here's what you get:

x = np.linspace(0, 1, 10000)
>>> roll_kurt(x, 10, 0)[-1]  # current master, completely broken
39.756569994447986
>>> roll_kurt((x - x.mean()) / x.std(), 10, 0)[-1]  # this PR, better but 30% off
-1.5704869226377387
>>> roll_kurt(x - x.mean(), 10, 0)[-1]  # similar idea to this PR, faster, less robust
-0.15264283115696387
>>> roll_kurt2(x ,10, 0)[-1]  # a stabler algorithm, not bad, but 40% off
-1.6595757481342461
>>> roll_kurt2(x - x.mean(), 10, 0)[-1]  # stabler algo + centering, getting close
-1.4211005796851859
>>> roll_kurt2((x - x.mean()) / x.std(), 10, 0)[-1]  # stabler algo + this PR, almost (4% off)
-1.2454174095606518

There is also the thing with performance... Some timings to put things in context, with the same above x:

%timeit roll_kurt(x, 10, 0)  # current master is terribly slow
1000 loops, best of 3: 890 µs per loop

%timeit roll_kurt2(x, 10, 0)  # better stability and 2.5x faster
1000 loops, best of 3: 375 µs per loop

%%timeit  # this PR preprocessing mock-up
... mask = np.isfinite(x)
... y = x.copy()
... y[mask] -= np.mean(x[mask])
... y[mask] /= np.std(x[mask])
1000 loops, best of 3: 333 µs per loop

%%timeit  # similar to this PR, 25% faster
... mask = np.isfinite(x)
... y = x.copy()
... y -= np.mean(x[mask])
... y /= np.std(x[mask])
1000 loops, best of 3: 244 µs per loop

%%timeit  # centering w/o normalizing is 2.5x faster
... mask = np.isfinite(x)
... y = x.copy()
... y -= np.mean(x[mask])
10000 loops, best of 3: 103 µs per loop

In [27]: %timeit y = x - x[mask].mean()  # building the mask is not cheap
10000 loops, best of 3: 64.4 µs per loop

I think there is plenty of food for thought here. My views on this:

  • The current algorithms in the core roll_skew and roll_kurt are very, very bad, and need to be replaced. I will do that shortly.
  • Getting rolling skewness and kurtosis right is hard! Using z-scores, or at least centering, may be a good idea, but you have to be careful with performance.

So based on all of the above, my original -1 is now a +0.25. To make it a full +1, I would suggest:

  • Move the code to the Python wrapper. When the Python functions rolling_skew or rolling_kurt are called, before calling the Cython functions roll_skew or roll_kurt, a call is made to _rolling_moment, which calls _process_data_structure, where a mask of inf values in the arrays is already being calculated, see here. If you can mix both maskings into a single one, you can minimize the effect on performance, and get the extra robustness almost for free.
  • I would skip the division by the standard deviation. It is taking 50% of the time (more if the maskings are joined), and for most cases it seems to have a marginal effect on robustness.

@jreback
Copy link
Contributor

jreback commented Sep 29, 2014

@jaimefrio sounds like a nice comprehensive approach. Pls submit a PR when ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants