New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling skew and rolling kurt has wrong answer under all eq values #18044

Closed
zartbot opened this Issue Oct 31, 2017 · 6 comments

Comments

Projects
None yet
5 participants
@zartbot
Contributor

zartbot commented Oct 31, 2017

Code Sample, a copy-pastable example if possible

pd.Series([1.1]*15).rolling(10).skew()
pd.Series([1.1]*15).rolling(10).kurt()

Problem description

rolling skew() wrong result:

pd.Series([1.1]*15).rolling(10).skew()

0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 -1.056765e+08
10 -1.056765e+08
11 -1.056765e+08
12 -1.056765e+08
13 -1.056765e+08
14 -1.056765e+08
dtype: float64

rolling kurt() wrong result:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 1.201335e+16
10 1.201335e+16
11 1.201335e+16
12 1.201335e+16
13 1.201335e+16
14 1.201335e+16
dtype: float64

But try pd.Series([1.1]*15).tail(10).skew() and pd.Series([1.1]*15).tail(10).kurt() could get the right result.
and also a reference workaround by using apply to call scipy:

pd.Series([1.1]*15).rolling(10).apply(lambda x: scipy.stats.skew(x))
pd.Series([1.1]*15).rolling(10).apply(lambda x: scipy.stats.kurt(x))

seems it has bug in rolling_skew/rolling_kurt

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-93-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.6.0
Cython: 0.24
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.0
tables: 3.2.2
numexpr: 2.5.2
feather: None
matplotlib: 2.0.0
openpyxl: 2.2.0-b1
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.3
lxml: 3.5.0
bs4: None
html5lib: 1.0b10
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Oct 31, 2017

Member

@ogotaiking : Thanks for reporting this! For reference, do you mind posting the expected results too?

Member

gfyoung commented Oct 31, 2017

@ogotaiking : Thanks for reporting this! For reference, do you mind posting the expected results too?

@selik

This comment has been minimized.

Show comment
Hide comment
@selik

selik Oct 31, 2017

Contributor

@gfyoung The skew of a uniform distribution is zero. The kurtosis of the uniform distribution could be -6/5, depending on how you define it. While @ogotaiking says the kurtosis is calculated correctly in the non-rolling case, it's returning zero for me, so that might not be what's desired.

Contributor

selik commented Oct 31, 2017

@gfyoung The skew of a uniform distribution is zero. The kurtosis of the uniform distribution could be -6/5, depending on how you define it. While @ogotaiking says the kurtosis is calculated correctly in the non-rolling case, it's returning zero for me, so that might not be what's desired.

@zartbot

This comment has been minimized.

Show comment
Hide comment
@zartbot

zartbot Nov 1, 2017

Contributor

The issue would be variance numerically unstable in "calc_skew" , the answer will divide by B , but the code only check B<=0, during rolling, add_skew/remove_skew may cause B in range(0,1e-9), so after that uniform distribution will get such large number in skew/kurt.

#6817 from jaimefrio/rolling_var-welford resolve the roll_var stability , but seems it will also import to skew/kurt.

# Rolling skewness

cdef inline double calc_skew(int64_t minp, int64_t nobs, double x, double xx,
                             double xxx) nogil:
    cdef double result, dnobs
    cdef double A, B, C, R

    if nobs >= minp:
        dnobs = <double>nobs
        A = x / dnobs
        B = xx / dnobs - A * A
        C = xxx / dnobs - A * A * A - 3 * A * B
        if B <= 0 or nobs < 3: <-- B's precision has some issue cause uniform distribution not be ZERO
            result = NaN
        else:
            R = sqrt(B)
            result = ((sqrt(dnobs * (dnobs - 1.)) * C) /
                      ((dnobs - 2) * R * R * R))  <-- here will divde 1e-N
    else:
        result = NaN

    return result
Contributor

zartbot commented Nov 1, 2017

The issue would be variance numerically unstable in "calc_skew" , the answer will divide by B , but the code only check B<=0, during rolling, add_skew/remove_skew may cause B in range(0,1e-9), so after that uniform distribution will get such large number in skew/kurt.

#6817 from jaimefrio/rolling_var-welford resolve the roll_var stability , but seems it will also import to skew/kurt.

# Rolling skewness

cdef inline double calc_skew(int64_t minp, int64_t nobs, double x, double xx,
                             double xxx) nogil:
    cdef double result, dnobs
    cdef double A, B, C, R

    if nobs >= minp:
        dnobs = <double>nobs
        A = x / dnobs
        B = xx / dnobs - A * A
        C = xxx / dnobs - A * A * A - 3 * A * B
        if B <= 0 or nobs < 3: <-- B's precision has some issue cause uniform distribution not be ZERO
            result = NaN
        else:
            R = sqrt(B)
            result = ((sqrt(dnobs * (dnobs - 1.)) * C) /
                      ((dnobs - 2) * R * R * R))  <-- here will divde 1e-N
    else:
        result = NaN

    return result
@zartbot

This comment has been minimized.

Show comment
Hide comment
@zartbot

zartbot Nov 1, 2017

Contributor

in pandas/_libs/window.pyx
RCA(Root Cause Analysis):
for uniform distribution, all values are eq.
function roll_skew will continues call add_skew / remove_skew.

xx  += val * val
xx -= prev* prev

but during rolling, it will cause numerically unstable in xx,
previously suggest is a bad fix

Contributor

zartbot commented Nov 1, 2017

in pandas/_libs/window.pyx
RCA(Root Cause Analysis):
for uniform distribution, all values are eq.
function roll_skew will continues call add_skew / remove_skew.

xx  += val * val
xx -= prev* prev

but during rolling, it will cause numerically unstable in xx,
previously suggest is a bad fix

@gfyoung

This comment has been minimized.

Show comment
Hide comment
@gfyoung

gfyoung Nov 1, 2017

Member

@ogotaiking : Cool! Feel free to propose changes in a PR and add a test or two to confirm. We can look at it in more detail there.

Member

gfyoung commented Nov 1, 2017

@ogotaiking : Cool! Feel free to propose changes in a PR and add a test or two to confirm. We can look at it in more detail there.

@zartbot

This comment has been minimized.

Show comment
Hide comment
@zartbot

zartbot Nov 1, 2017

Contributor

@gfyoung

just did more test, the previous fix does not work in some particular numbers, and check core/nanops.py,

during calculate m2 and m3 , it has the following function to validate the float issue.

_zero_out_fperr(arg):
    if isinstance(arg, np.ndarray):
        with np.errstate(invalid='ignore'):
            return np.where(np.abs(arg) < 1e-14, 0, arg)
    else:
        return arg.dtype.type(0) if np.abs(arg) < 1e-14 else arg

but in window.pyx it's a cython code , i am not familiar with it to modify or do such validation.

I just did some simple change

diff --git a/pandas/_libs/window.pyx b/pandas/_libs/window.pyx
index b6bd6f9..64a1a40 100644
--- a/pandas/_libs/window.pyx
+++ b/pandas/_libs/window.pyx
@@ -788,7 +788,7 @@ cdef inline double calc_skew(int64_t minp, int64_t nobs, double x, double xx,
         A = x / dnobs
         B = xx / dnobs - A * A
         C = xxx / dnobs - A * A * A - 3 * A * B
-        if B <= 0 or nobs < 3:
+        if B <= 1e-14 or nobs < 3:
             result = NaN
         else:
             R = sqrt(B)
@@ -915,7 +915,7 @@ cdef inline double calc_kurt(int64_t minp, int64_t nobs, double x, double xx,
         R = R * A
         D = xxxx / dnobs - R - 6 * B * A * A - 4 * C * A
 
-        if B == 0 or nobs < 4:
+        if B <= 1e-14 or nobs < 4:
             result = NaN
         else:
             K = (dnobs * dnobs - 1.) * D / (B * B) - 3 * ((dnobs - 1.) ** 2)

and I did more test on random numbers with uniform distribution

c=0.0
for i in range(0,10000):
    a = pd.Series([np.random.rand(1)]*30)
    b=max(a.rolling(10).skew().fillna(0))
    if b > c :
        c = b

the result looks good , after try 10000 random numbers , c still eq zero.
and also check random distribution skew:

tc=0.0
for i in range(0,10000):
    test=pd.Series(np.random.rand(500))
    aa = test.rolling(10).skew().fillna(0)
    bb = test.rolling(10).apply(lambda x: pd.Series(x).skew()).fillna(0)
    cc = max(aa - bb)
    if cc > tc:
        tc = cc
tc

the max diff between rolling_skew and nanops.nanskew is : 1.0452083643031074e-11

Contributor

zartbot commented Nov 1, 2017

@gfyoung

just did more test, the previous fix does not work in some particular numbers, and check core/nanops.py,

during calculate m2 and m3 , it has the following function to validate the float issue.

_zero_out_fperr(arg):
    if isinstance(arg, np.ndarray):
        with np.errstate(invalid='ignore'):
            return np.where(np.abs(arg) < 1e-14, 0, arg)
    else:
        return arg.dtype.type(0) if np.abs(arg) < 1e-14 else arg

but in window.pyx it's a cython code , i am not familiar with it to modify or do such validation.

I just did some simple change

diff --git a/pandas/_libs/window.pyx b/pandas/_libs/window.pyx
index b6bd6f9..64a1a40 100644
--- a/pandas/_libs/window.pyx
+++ b/pandas/_libs/window.pyx
@@ -788,7 +788,7 @@ cdef inline double calc_skew(int64_t minp, int64_t nobs, double x, double xx,
         A = x / dnobs
         B = xx / dnobs - A * A
         C = xxx / dnobs - A * A * A - 3 * A * B
-        if B <= 0 or nobs < 3:
+        if B <= 1e-14 or nobs < 3:
             result = NaN
         else:
             R = sqrt(B)
@@ -915,7 +915,7 @@ cdef inline double calc_kurt(int64_t minp, int64_t nobs, double x, double xx,
         R = R * A
         D = xxxx / dnobs - R - 6 * B * A * A - 4 * C * A
 
-        if B == 0 or nobs < 4:
+        if B <= 1e-14 or nobs < 4:
             result = NaN
         else:
             K = (dnobs * dnobs - 1.) * D / (B * B) - 3 * ((dnobs - 1.) ** 2)

and I did more test on random numbers with uniform distribution

c=0.0
for i in range(0,10000):
    a = pd.Series([np.random.rand(1)]*30)
    b=max(a.rolling(10).skew().fillna(0))
    if b > c :
        c = b

the result looks good , after try 10000 random numbers , c still eq zero.
and also check random distribution skew:

tc=0.0
for i in range(0,10000):
    test=pd.Series(np.random.rand(500))
    aa = test.rolling(10).skew().fillna(0)
    bb = test.rolling(10).apply(lambda x: pd.Series(x).skew()).fillna(0)
    cc = max(aa - bb)
    if cc > tc:
        tc = cc
tc

the max diff between rolling_skew and nanops.nanskew is : 1.0452083643031074e-11

@zartbot zartbot referenced this issue Nov 1, 2017

Merged

BUG:fix rolling skew kurt floating issue #18065

4 of 4 tasks complete

@TomAugspurger TomAugspurger added this to the Next Major Release milestone Nov 1, 2017

@jreback jreback modified the milestones: Next Major Release, 0.21.1 Nov 4, 2017

@jreback jreback added the Bug label Nov 4, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment