# rolling skew and rolling kurt has wrong answer under all eq values #18044

Closed
opened this Issue Oct 31, 2017 · 6 comments

Projects
None yet
5 participants
Contributor

### zartbot commented Oct 31, 2017 • edited

#### Code Sample, a copy-pastable example if possible

```pd.Series([1.1]*15).rolling(10).skew()
pd.Series([1.1]*15).rolling(10).kurt()

```

#### Problem description

rolling skew() wrong result:

`pd.Series([1.1]*15).rolling(10).skew()`

0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 -1.056765e+08
10 -1.056765e+08
11 -1.056765e+08
12 -1.056765e+08
13 -1.056765e+08
14 -1.056765e+08
dtype: float64

rolling kurt() wrong result:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 1.201335e+16
10 1.201335e+16
11 1.201335e+16
12 1.201335e+16
13 1.201335e+16
14 1.201335e+16
dtype: float64

But try pd.Series([1.1]*15).tail(10).skew() and pd.Series([1.1]*15).tail(10).kurt() could get the right result.
and also a reference workaround by using apply to call scipy:

```pd.Series([1.1]*15).rolling(10).apply(lambda x: scipy.stats.skew(x))
pd.Series([1.1]*15).rolling(10).apply(lambda x: scipy.stats.kurt(x))```

seems it has bug in rolling_skew/rolling_kurt

## [paste the output of `pd.show_versions()` here below this line] INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-93-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.6.0
Cython: 0.24
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.5.2
feather: None
matplotlib: 2.0.0
openpyxl: 2.2.0-b1
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.3
lxml: 3.5.0
bs4: None
html5lib: 1.0b10
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None

Member

### gfyoung commented Oct 31, 2017 • edited

 @ogotaiking : Thanks for reporting this! For reference, do you mind posting the expected results too?
Contributor

### selik commented Oct 31, 2017 • edited

 @gfyoung The skew of a uniform distribution is zero. The kurtosis of the uniform distribution could be -6/5, depending on how you define it. While @ogotaiking says the kurtosis is calculated correctly in the non-rolling case, it's returning zero for me, so that might not be what's desired.
Contributor

### zartbot commented Nov 1, 2017

 The issue would be variance numerically unstable in "calc_skew" , the answer will divide by B , but the code only check B<=0, during rolling, add_skew/remove_skew may cause B in range(0,1e-9), so after that uniform distribution will get such large number in skew/kurt. #6817 from jaimefrio/rolling_var-welford resolve the roll_var stability , but seems it will also import to skew/kurt. ```# Rolling skewness cdef inline double calc_skew(int64_t minp, int64_t nobs, double x, double xx, double xxx) nogil: cdef double result, dnobs cdef double A, B, C, R if nobs >= minp: dnobs = nobs A = x / dnobs B = xx / dnobs - A * A C = xxx / dnobs - A * A * A - 3 * A * B if B <= 0 or nobs < 3: <-- B's precision has some issue cause uniform distribution not be ZERO result = NaN else: R = sqrt(B) result = ((sqrt(dnobs * (dnobs - 1.)) * C) / ((dnobs - 2) * R * R * R)) <-- here will divde 1e-N else: result = NaN return result```
Contributor

### zartbot commented Nov 1, 2017 • edited

 in pandas/_libs/window.pyx RCA(Root Cause Analysis): for uniform distribution, all values are eq. function roll_skew will continues call add_skew / remove_skew. ```xx += val * val xx -= prev* prev``` but during rolling, it will cause numerically unstable in xx, previously suggest is a bad fix
Member

### gfyoung commented Nov 1, 2017

 @ogotaiking : Cool! Feel free to propose changes in a PR and add a test or two to confirm. We can look at it in more detail there.
Contributor

### zartbot commented Nov 1, 2017 • edited

 @gfyoung just did more test, the previous fix does not work in some particular numbers, and check core/nanops.py, during calculate m2 and m3 , it has the following function to validate the float issue. ```_zero_out_fperr(arg): if isinstance(arg, np.ndarray): with np.errstate(invalid='ignore'): return np.where(np.abs(arg) < 1e-14, 0, arg) else: return arg.dtype.type(0) if np.abs(arg) < 1e-14 else arg``` but in window.pyx it's a cython code , i am not familiar with it to modify or do such validation. I just did some simple change ```diff --git a/pandas/_libs/window.pyx b/pandas/_libs/window.pyx index b6bd6f9..64a1a40 100644 --- a/pandas/_libs/window.pyx +++ b/pandas/_libs/window.pyx @@ -788,7 +788,7 @@ cdef inline double calc_skew(int64_t minp, int64_t nobs, double x, double xx, A = x / dnobs B = xx / dnobs - A * A C = xxx / dnobs - A * A * A - 3 * A * B - if B <= 0 or nobs < 3: + if B <= 1e-14 or nobs < 3: result = NaN else: R = sqrt(B) @@ -915,7 +915,7 @@ cdef inline double calc_kurt(int64_t minp, int64_t nobs, double x, double xx, R = R * A D = xxxx / dnobs - R - 6 * B * A * A - 4 * C * A - if B == 0 or nobs < 4: + if B <= 1e-14 or nobs < 4: result = NaN else: K = (dnobs * dnobs - 1.) * D / (B * B) - 3 * ((dnobs - 1.) ** 2)``` and I did more test on random numbers with uniform distribution ```c=0.0 for i in range(0,10000): a = pd.Series([np.random.rand(1)]*30) b=max(a.rolling(10).skew().fillna(0)) if b > c : c = b``` the result looks good , after try 10000 random numbers , c still eq zero. and also check random distribution skew: ```tc=0.0 for i in range(0,10000): test=pd.Series(np.random.rand(500)) aa = test.rolling(10).skew().fillna(0) bb = test.rolling(10).apply(lambda x: pd.Series(x).skew()).fillna(0) cc = max(aa - bb) if cc > tc: tc = cc tc``` the max diff between rolling_skew and nanops.nanskew is : 1.0452083643031074e-11

Merged

Closed