Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling standard deviation fails when used with win_type #26597

Open
Connossor opened this issue May 31, 2019 · 7 comments

Comments

@Connossor
Copy link

commented May 31, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({'a': range(6)})
df['a'].rolling(3, win_type='blackman').agg(['mean', 'std'])

Problem description

When calculating rolling aggregations with a window function, sometimes there is an error which is quite hard to understand. I think it might be that the std() aggregation is not compatible with certain window types, or something like that- but it's not clear from the error messages or documentation what is going wrong.

Here is the stack trace that I see when I run the code sample above:


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
 in 
      3 df = pd.DataFrame({'a': range(6)})
      4 
----> 5 print(df['a'].rolling(3, win_type='blackman').agg(['mean', 'std']))

~\AppData\Local\Continuum\anaconda3\envs\petropy\lib\site-packages\pandas\core\window.py in aggregate(self, arg, *args, **kwargs)
741 @Appender(_shared_docs['aggregate'])
742 def aggregate(self, arg, *args, **kwargs):
--> 743 result, how = self._aggregate(arg, *args, **kwargs)
744 if result is None:
745

~\AppData\Local\Continuum\anaconda3\envs\petropy\lib\site-packages\pandas\core\base.py in _aggregate(self, arg, *args, **kwargs)
557 return self._aggregate_multiple_funcs(arg,
558 _level=_level,
--> 559 _axis=_axis), None
560 else:
561 result = None

~\AppData\Local\Continuum\anaconda3\envs\petropy\lib\site-packages\pandas\core\base.py in _aggregate_multiple_funcs(self, arg, _level, _axis)
587 try:
588 colg = self._gotitem(obj.name, ndim=1, subset=obj)
--> 589 results.append(colg.aggregate(a))
590
591 # make sure we find a good name

~\AppData\Local\Continuum\anaconda3\envs\petropy\lib\site-packages\pandas\core\window.py in aggregate(self, arg, *args, **kwargs)
741 @Appender(_shared_docs['aggregate'])
742 def aggregate(self, arg, *args, **kwargs):
--> 743 result, how = self._aggregate(arg, *args, **kwargs)
744 if result is None:
745

~\AppData\Local\Continuum\anaconda3\envs\petropy\lib\site-packages\pandas\core\base.py in _aggregate(self, arg, *args, **kwargs)
354 if isinstance(arg, compat.string_types):
355 return self._try_aggregate_string_function(arg, *args,
--> 356 **kwargs), None
357
358 if isinstance(arg, dict):

~\AppData\Local\Continuum\anaconda3\envs\petropy\lib\site-packages\pandas\core\base.py in _try_aggregate_string_function(self, arg, *args, **kwargs)
321 f = getattr(np, arg, None)
322 if f is not None:
--> 323 return f(self, *args, **kwargs)
324
325 raise ValueError("{arg} is an unknown string function".format(arg=arg))

~\AppData\Local\Continuum\anaconda3\envs\petropy\lib\site-packages\numpy\core\fromnumeric.py in std(a, axis, dtype, out, ddof, keepdims)
3240
3241 return _methods._std(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
-> 3242 **kwargs)
3243
3244

~\AppData\Local\Continuum\anaconda3\envs\petropy\lib\site-packages\numpy\core_methods.py in _std(a, axis, dtype, out, ddof, keepdims)
138 def _std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False):
139 ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
--> 140 keepdims=keepdims)
141
142 if isinstance(ret, mu.ndarray):

~\AppData\Local\Continuum\anaconda3\envs\petropy\lib\site-packages\numpy\core_methods.py in _var(a, axis, dtype, out, ddof, keepdims)
110 arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
111 else:
--> 112 arrmean = arrmean.dtype.type(arrmean / rcount)
113
114 # Compute sum of squared deviations from mean

~\AppData\Local\Continuum\anaconda3\envs\petropy\lib\site-packages\pandas\core\window.py in getattr(self, attr)
148
149 raise AttributeError("%r object has no attribute %r" %
--> 150 (type(self).name, attr))
151
152 def _dir_additions(self):

AttributeError: 'Window' object has no attribute 'dtype'

Expected Output

When I run the code above without the win_type paramater, everything works fine:

import pandas as pd
df = pd.DataFrame({'a': range(6)})
df['a'].rolling(3, win_type=None).agg(['mean', 'std'])

Result:


   mean  std
0   NaN  NaN
1   NaN  NaN
2   1.0  1.0
3   2.0  1.0
4   3.0  1.0
5   4.0  1.0

Thanks in advance for any tips!

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: 1.8.5
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.2
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented May 31, 2019

think it might be that the std() aggregation is not compatible with certain window types

That seems to be the case.

In [23]: df.rolling(3, win_type='blackman').agg('std')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-0df3ddf74018> in <module>
----> 1 df.rolling(3, win_type='blackman').agg('std')

~/sandbox/pandas/pandas/core/window.py in aggregate(self, arg, *args, **kwargs)
    748     @Appender(_shared_docs['aggregate'])
    749     def aggregate(self, arg, *args, **kwargs):
--> 750         result, how = self._aggregate(arg, *args, **kwargs)
    751         if result is None:
    752

~/sandbox/pandas/pandas/core/base.py in _aggregate(self, arg, *args, **kwargs)
    328         if isinstance(arg, str):
    329             return self._try_aggregate_string_function(arg, *args,
--> 330                                                        **kwargs), None
    331
    332         if isinstance(arg, dict):

~/sandbox/pandas/pandas/core/base.py in _try_aggregate_string_function(self, arg, *args, **kwargs)
    295         f = getattr(np, arg, None)
    296         if f is not None:
--> 297             return f(self, *args, **kwargs)
    298
    299         raise ValueError("{arg} is an unknown string function".format(arg=arg))

<__array_function__ internals> in std(*args, **kwargs)

~/sandbox/numpy/numpy/core/fromnumeric.py in std(a, axis, dtype, out, ddof, keepdims)
   3354
   3355     return _methods._std(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
-> 3356                          **kwargs)
   3357
   3358

~/sandbox/numpy/numpy/core/_methods.py in _std(a, axis, dtype, out, ddof, keepdims)
    214 def _std(a, axis=None, dtype=None, out=None, ddof=0, keepdims=False):
    215     ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
--> 216                keepdims=keepdims)
    217
    218     if isinstance(ret, mu.ndarray):

~/sandbox/numpy/numpy/core/_methods.py in _var(a, axis, dtype, out, ddof, keepdims)
    185                 arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
    186     else:
--> 187         arrmean = arrmean.dtype.type(arrmean / rcount)
    188
    189     # Compute sum of squared deviations from mean

~/sandbox/pandas/pandas/core/window.py in __getattr__(self, attr)
    146
    147         raise AttributeError("%r object has no attribute %r" %
--> 148                              (type(self).__name__, attr))
    149
    150     def _dir_additions(self):

AttributeError: 'Window' object has no attribute 'dtype'

it'd be good to verify if that's intentional, or whether it's an implementation detail. If it's intentional, then we should be able to verify which aggfuncs are compatible with which window types, and raise an informative error message before attempting the aggregation.

@Connossor are you interested in doing that investigation?

@Connossor

This comment has been minimized.

Copy link
Author

commented Jun 1, 2019

Hi @TomAugspurger,

Thanks for the response, and yep agree with your plan.

I did my best to look through the codebase, and it's not clear to me which aggfuncs are supposed to be compatible with which window types. It seems from here that if a string aggregation such as "std", then if it exists as a function in numpy it gets used:

pandas/pandas/core/base.py

Lines 299 to 323 in cb00deb

def _try_aggregate_string_function(self, arg, *args, **kwargs):
"""
if arg is a string, then try to operate on it:
- try to find a function (or attribute) on ourselves
- try to find a numpy function
- raise
"""
assert isinstance(arg, compat.string_types)
f = getattr(self, arg, None)
if f is not None:
if callable(f):
return f(*args, **kwargs)
# people may try to aggregate on a non-callable attribute
# but don't let them think they can pass args to it
assert len(args) == 0
assert len([kwarg for kwarg in kwargs
if kwarg not in ['axis', '_level']]) == 0
return f
f = getattr(np, arg, None)
if f is not None:
return f(self, *args, **kwargs)

From some research I think a std() ought to be useable with any window type. A formula for a weighted standard deviation is here:
https://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf .

@Connossor

This comment has been minimized.

Copy link
Author

commented Jun 1, 2019

This issue may be related to issue #26462, where the mean() aggregation seems to have different behaviour for a rolling window versus a grouby-rolling window.

@Connossor

This comment has been minimized.

Copy link
Author

commented Jun 2, 2019

@TomAugspurger I've investigated as you suggested, here is what I think is happening:

This function seems to govern what class is actually used: we get a pandas.core.window.Window object if the win_type parameter is set, otherwise a pandas.core.window.Rolling object which seems to a be effectively a Window with uniform weights.

pandas/pandas/core/window.py

Lines 2626 to 2633 in addc5fc

def rolling(obj, win_type=None, **kwds):
if not isinstance(obj, (ABCSeries, ABCDataFrame)):
raise TypeError('invalid type: %s' % type(obj))
if win_type is not None:
return Window(obj, win_type=win_type, **kwds)
return Rolling(obj, **kwds)

Then, when the "std" aggregation is called the _try_aggregate_string_function() method linked above looks to see of the object has a .std() method, and if none is present falls back to a numpy function.

The Window class has no .std() method. It falls back to the numpy implementation of std() which fails, as per the example above.

On the other hand, the Rolling class has a std() method which works just fine.

On a related note: the pandas.core.window.RollingGroupby class seems to inherit the mean() method from the Rolling class, and hence completely ignores the win_type paramater.

Overall, it looks like we need two fixes:

  1. Raise an informative error if the "std" aggregation is used on a Window object, or alternatively implement a weighted standard deviation as described above.
  2. Ensure consistent behaviour with groupby. Rolling is to Window as RollingGroupby is to WindowGroupby, so perhaps we need a WindowGroupby class?

What do you think?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jun 3, 2019

Your number 1. sounds reasonable.

Not sure about 2. I'm not that familiar with this code.

@topper-123 topper-123 added Numeric Window and removed Numeric labels Jun 4, 2019

@jreback jreback added this to the 0.25.0 milestone Jul 9, 2019

@ihsansecer

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

I want to implement weighted variance function (using proposed algo here) but Window uses single function with an argument avg (if it is True mean else sum) for calculating both mean and sum. I think the function should be splitted into two at first. What do you think? @jreback @WillAyd

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

sure this could be changed

@jreback jreback modified the milestones: 0.25.0, 1.0 Jul 17, 2019

@ihsansecer ihsansecer referenced a pull request that will close this issue Jul 31, 2019

Open

ENH: Implement weighted rolling var and std #27682

5 of 5 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.