Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autocorr() and autocorrelation_plot() output mismatch #24608

Open
guocheng opened this issue Jan 4, 2019 · 8 comments
Open

autocorr() and autocorrelation_plot() output mismatch #24608

guocheng opened this issue Jan 4, 2019 · 8 comments
Labels

Comments

@guocheng
Copy link

guocheng commented Jan 4, 2019

import numpy as np
import pandas as pd
from pandas.plotting import autocorrelation_plot
import matplotlib.pyplot as plt

dr = pd.date_range(start='1984-01-01', end='1984-12-31')

df = pd.DataFrame(np.arange(len(dr)), index=dr, columns=["Values"])
autocorrelation_plot(df)
plt.show()

image

The autocorr() produces a different result:

for i in range(0,366):
    print(df['Values'].autocorr(lag=i)) # output is 1 for all lags

Problem description

Pandas's autocorr() method uses the mean and std of each Series (modified original series and lagged series) for calculation. The autocorrelation_plot method used the mean and std of the unmodified original series for calculation.

Expected Output

autocorr() and autocorrelation_plot() should output the same result.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 35.0.2
Cython: 0.28.5
numpy: 1.15.1
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: 2.5.7
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0

@ayhanfuat
Copy link

ayhanfuat commented Jan 4, 2019

autocorrelation_plot uses this formula (except that it uses N instead of N-k):
autocorrelation formula screenshot from wikipedia

Here, mu is the mean of the entire series. autocorr on the other hand shifts the series and takes the correlation with the non-shifted one which is practically calculating separate means and variances for the lagged and not lagged versions. Wikipedia says that's also valid (link):

Other possibilities derive from treating the two portions of data separately and calculating separate sample means and/or sample variances for use in defining the estimate.
The advantage of estimates of the last type is that the set of estimated autocorrelations, as a function of k, then form a function which is a valid autocorrelation in the sense that it is possible to define a theoretical process having exactly that autocorrelation. Other estimates can suffer from the problem that, if they are used to calculate the variance of a linear combination of the X's, the variance calculated may turn out to be negative.

That's not very common though. All the formulas I've seen before use the mean of the entire series. But of course this is just me. Different people from different backgrounds might be using different versions.

@guocheng
Copy link
Author

guocheng commented Jan 4, 2019

autocorrelation_plot uses this formula (except that it uses N instead of N-k):
autocorrelation formula screenshot from wikipedia

Here, mu is the mean of the entire series. autocorr on the other hand shifts the series and takes the correlation with the non-shifted one which is practically calculating separate means and variances for the lagged and not lagged versions. Wikipedia says that's also valid (link):

Other possibilities derive from treating the two portions of data separately and calculating separate sample means and/or sample variances for use in defining the estimate.
The advantage of estimates of the last type is that the set of estimated autocorrelations, as a function of k, then form a function which is a valid autocorrelation in the sense that it is possible to define a theoretical process having exactly that autocorrelation. Other estimates can suffer from the problem that, if they are used to calculate the variance of a linear combination of the X's, the variance calculated may turn out to be negative.

That's not very common though. All the formulas I've seen before use the mean of the entire series. But of course this is just me. Different people from different backgrounds might be using different versions.

If you look at autocorrelation_plot source code, you can see that it uses this equation:

image

@ayhanfuat
Copy link

@guocheng It's the same equation except for the normalization factor (n vs n-k).

Shifting occurs here.

@guocheng
Copy link
Author

guocheng commented Jan 4, 2019

@ayhanfuat Thanks for pointing that out.

Just to clarify a few things:

Assuming we have an array [1,2,3,4,5], when lag = 1:

1.we are trying to calculate the correlation of:

[1,2,3,4]
[2,3,4,5]

Correct?

  1. If self.corr(self.shift(lag)) is executed, then the correlation of the following is calculated:

[1,2,3,4,5]
[NaN,1,2,3,4]

Correct?

@ayhanfuat
Copy link

@guocheng Those two are actually the same thing since pandas ignores NaN's in calculations:

pd.Series([1, 2, 3, 4]).corr(pd.Series([2, 3, 4, 5]))

and

pd.Series([1, 2, 3, 4, 5]).corr(pd.Series([np.nan, 1, 2, 3, 4]))

yield the same result (1).

I tried to illustrate the difference between autocorr and autocorrelation_plot in these equations:

autocorrelation equations

In your example, [1, 2, 3, 4, 5] is the entire series. For the first lag you have two series: left = [1, 2, 3, 4] and right = [2, 3, 4, 5]. Now, if you calculate the correlation between left and right you first compute their corresponding means and variances (mu_left and mu_right, and sigma_left and sigma_right). autocorrelation_plot does not compute these separately. It computes a single mean (mean of [1, 2, 3, 4, 5]) and a single variance (again, computed on [1, 2, 3, 4, 5]). In its source code

def r(h):
    return ((data[:n - h] - mean) *
            (data[h:] - mean)).sum() / float(n) / c0

you see a single mean. This mean is computed over the entire data series. If you compute two different means (mean of data[:n - h] and mean of data[h:]) you would be doing the same thing autocorr is doing. Of course that would require you to modify c0 for variances as well.

@guocheng
Copy link
Author

guocheng commented Jan 5, 2019

@ayhanfuat yes, I realised that using the sample mean vs Series mean made the difference. So the question is: Should autocorr() and autocorrelation_plot() behave the same? If so, which one is correct? (they are currently outputting vastly different result for this particular example)

Aside from the issue above, thank for pointing out the following:

pd.Series([1, 2, 3, 4]).corr(pd.Series([2, 3, 4, 5]))
pd.Series([1, 2, 3, 4, 5]).corr(pd.Series([np.nan, 1, 2, 3, 4])) #this one looks weird

The "shifted" array looks weird to me and it is unclear what is actually happening underneath.

For instance:

Input: np.corrcoef(pd.Series([1,2,3,4,5]), pd.Series([np.nan,1,2,3,4]))

output: array([[ 1., nan],
             [nan, nan]])

Input: np.corrcoef(pd.Series([1,2,3,4]), pd.Series([2,3,4,5]))

output: array([[1., 1.],
             [1., 1.]])

The part that I am not clear is that how does Pandas automagically figure out that pd.Series([np.nan, 1, 2, 3, 4]) is actually pd.Series([2, 3, 4, 5])

@guocheng guocheng changed the title autocorr() might be incorrect for calculating autocorrelation autocorr() and autocorrelation_plot() output mismatch Jan 6, 2019
@mroeschke mroeschke added the Visualization plotting label Jan 13, 2019
@shadiakiki1986
Copy link
Contributor

Hi guys. I published this jupyter notebook in which I suggest improvements to autocorrelation_plot along the lines of what is suggested in this issue. It would be great to hear your feedback before going through with a PR

@guocheng
Copy link
Author

guocheng commented May 7, 2019

Hi guys. I published this jupyter notebook in which I suggest improvements to autocorrelation_plot along the lines of what is suggested in this issue. It would be great to hear your feedback before going through with a PR

Try this code block:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

dr = pd.date_range(start='1984-01-01', end='1984-12-31')

df = pd.DataFrame(np.arange(len(dr)), index=dr, columns=["Values"])
autocorrelation_plot_forked(df)
plt.show()

The result of the above output does not seem right to me. The autocorrelation should be 1 for all lags I think.

@mroeschke mroeschke added the Docs label Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants