# Autocorrelation

The second [youtuber](https://www.youtube.com/@egorhowell/videos) did a very good job of explaining stability. What does he say about [autocorrelation](https://www.youtube.com/watch?v=GcJ__g_cimA)?

We're using the same air passengers data set.

In [1]:
import pandas as pd

df = pd.read_csv('files/AirPassengers.csv', index_col=0)
df.index = pd.to_datetime(df.index)

The video uses a different library, but pandas also has a function to plot an autocorrelation plot.

In [None]:
pd.plotting.autocorrelation_plot(df['#Passengers'])

But what are we looking at? Well, let's zoom in to the first 24 'lags'.

In [None]:
pd.plotting.autocorrelation_plot(df['#Passengers']).set_xlim([0, 24])

The "lag" is the number of datapoints you go back to see if there is a correlation. This means we can detect som key points (look at the actual data below to compare):

* Upper left corner, lag 1: every datapoint is correlated to itself. This means you have a perfect positive correlation of 1.
* Lag 6: We're going back 6 months. In this dataset that is a bad idea as you are comparing summer months (spikes) to winter months (valleys), and the data clearly shows how that those two are not related.
* Lag 12: Go back 1 year. This means we are correlating spikes with spikes and valleys with valleys. The correlation is therefore high.
* Lag 24: Same as before, but slightly less as we are now skipping a year.

In [None]:
df.plot()

There is another [function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.autocorr.html) in pandas to calculate autocorrelation.

In [None]:
for i in range(1, 25):
    corr = df["#Passengers"].autocorr(lag=i)
    print(f"Lag {i}: { corr }")

We could graph this, but we'd be back at the plot we had before. Nice peaks at 12 and 24 though.