New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
autocorr() and autocorrelation_plot() output mismatch #24608
Comments
Here, mu is the mean of the entire series.
That's not very common though. All the formulas I've seen before use the mean of the entire series. But of course this is just me. Different people from different backgrounds might be using different versions. |
If you look at |
@ayhanfuat Thanks for pointing that out. Just to clarify a few things: Assuming we have an array [1,2,3,4,5], when lag = 1: 1.we are trying to calculate the correlation of: [1,2,3,4] Correct?
[1,2,3,4,5] Correct? |
@guocheng Those two are actually the same thing since pandas ignores NaN's in calculations:
and
yield the same result (1). I tried to illustrate the difference between In your example, [1, 2, 3, 4, 5] is the entire series. For the first lag you have two series:
you see a single |
@ayhanfuat yes, I realised that using the sample mean vs Series mean made the difference. So the question is: Should Aside from the issue above, thank for pointing out the following: pd.Series([1, 2, 3, 4]).corr(pd.Series([2, 3, 4, 5]))
pd.Series([1, 2, 3, 4, 5]).corr(pd.Series([np.nan, 1, 2, 3, 4])) #this one looks weird The "shifted" array looks weird to me and it is unclear what is actually happening underneath. For instance: Input: np.corrcoef(pd.Series([1,2,3,4,5]), pd.Series([np.nan,1,2,3,4]))
output: array([[ 1., nan],
[nan, nan]])
Input: np.corrcoef(pd.Series([1,2,3,4]), pd.Series([2,3,4,5]))
output: array([[1., 1.],
[1., 1.]]) The part that I am not clear is that how does Pandas automagically figure out that |
Hi guys. I published this jupyter notebook in which I suggest improvements to |
Try this code block: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dr = pd.date_range(start='1984-01-01', end='1984-12-31')
df = pd.DataFrame(np.arange(len(dr)), index=dr, columns=["Values"])
autocorrelation_plot_forked(df)
plt.show() The result of the above output does not seem right to me. The autocorrelation should be 1 for all lags I think. |
The
autocorr()
produces a different result:Problem description
Pandas's
autocorr()
method uses the mean and std of each Series (modified original series and lagged series) for calculation. Theautocorrelation_plot
method used the mean and std of the unmodified original series for calculation.Expected Output
autocorr()
andautocorrelation_plot()
should output the same result.Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 35.0.2
Cython: 0.28.5
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.7
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0
The text was updated successfully, but these errors were encountered: