-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
BUG: fix polluted window in skewness computation #62863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BUG: fix polluted window in skewness computation #62863
Conversation
This change uses online update for the mean instead of computing the centralized values. Additionally, it checks for possible catastrophic cancellation by big changes in 3rd central moment.
|
Reproduction of #47461 import pandas as pd
data = [
-2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, 2, -2, -2, -2, -2, -2, -2,
-2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, -2, 7729964027,
7729965641, 7729965103, 7729966179, 7729968330, 7729969406, 7729971020,
7729969406, 7729969944, 7729969406, 7729969406, 7729969944, 7729971558,
7729972633, 7729973171, 7729973171, 7729973171, 7729969406,
]
df = pd.DataFrame(data, columns=["data"])
print(df.data.rolling(10).skew()[-20:])
# 34 NaN
# 35 NaN
# 36 3.162278e+00
# 37 1.778781e+00
# 38 1.035098e+00
# 39 4.841229e-01
# 40 2.426355e-13
# 41 -4.841229e-01
# 42 -1.035098e+00
# 43 -1.778781e+00
# 44 -3.162278e+00
# 45 -3.928221e-01
# 46 -6.794734e-01
# 47 -1.272001e+00
# 48 -1.034738e+00
# 49 8.839369e-01
# 50 9.294772e-01
# 51 4.561840e-01
# 52 1.688976e-01
# 53 1.688976e-01
# Name: data, dtype: float64
print(df.data.tail(10).skew())
# 0.16889762763904356 |
|
The window from the example #62863 (comment) was recomputed 3 times.
|
| cdef: | ||
| float64_t result, dnobs | ||
| float64_t A, B, C, R | ||
| float64_t moments_ratio, correction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks you for using descriptive names for these variables!
|
I feel like this PR is changing a lot of the current behavior in the skew computation. I will minimize the changes to focus on numerical stability. |
| min_val = np.nanmin(values) | ||
| values_copy = np.copy(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Can we type values in the roll_skew signature as const float64_t[:] now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I ran the window tests with UBSAN and it didn't report a problem.
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.This PR changes the algorithm used to compute the skewness. Previously, the algorithm used computed the summation of$X^{3}$ , $X^{2}$ , and $X$ , and used kahan summation for stability. The problem is that $EX^3$ - $(EX)^3$ can suffer heavilly from catastrophic cancellation.
Now, it uses a Welford-like algorithm to compute up to the 3rd central moment, and continuously check for stability when updating the 3rd central moment, by checking for severe loss of significant digits when updating its value. If it detects a stability problem, it recomputes the window.
Benchmark
asv continuous -f 1.1 -E virtualenv:3.13 upstream/main HEAD -b rolling.MethodsNote
The performance increase is mainly due to the fact that it didn't have to recompute the window. This change can have a worse case scenario of$O(n * \text{window\_size})$