Skip to content

Pandas df.rolling.mean() abnormal behavior for a Series having larger numbers (in the scale of billions) #28305

@bobnvk99

Description

@bobnvk99

Code Sample, a copy-pastable example if possible

#Sample residual pcts
residuals_pct =[0.044516001,0.031137117,1.06758E+64,0.003522454,0.065171486,0.06033751,0.01325514,-0.005620799,-0.006225719,0.045713825,0.039280786,0.000531307]

#Creating a dataframe with residual percent
d = pd.DataFrame(residuals, columns=['residual_pct'])
#Shifting the pct 1 row down.
d['residual_pct_shift'] = d.residual_pct.shift(1)
#Calculating the adjusted residual pct.
d['adj_residual_pct'] = d['residual_pct_shift'][3:].rolling(window=3).mean()

Problem description

I am trying to implement adjusted residual percent as the rolling average of the previous three residual percents.
Due to some data issue, model forecasted the target variable in the scale of billions, consequently effecting the residual percentage calculation.
However, this is irrelevant to the current problem, the issue is when doing the rolling average using .....rolling(window=3).mean(), the average for the rows after the abnormal residual percent got affected.
If I do take those same numbers and do the normal way of averaging or use MS Excel, I got it right.

Please see the image below:
pandas_rolling_mean_issue

Output of pd.show_versions()

Details INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 142 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None

pandas : 0.25.1
numpy : 1.16.4
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : 0.29.13
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.7
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8

Is there anything that I am missing or unaware. I tried to check in the pandas' source code and found nothing. Irrespective of the scale of the numbers, excel and the normal way of averaging works fine.

Please advise.
I greatly appreciate your help.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs InfoClarification about behavior needed to assess issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions