Stationarity - distrbution remains constant though time.

Thanks to Egor Howell! https://www.youtube.com/playlist?list=PLKmQjl_R9bYd32uHImJxQSFZU5LPuXfQe

In [7]:
import pandas as pd
import plotly.express as px
import datetime
import statsmodels

In [8]:
# Raw data from somewhere on internet! Thanks selva86!
data = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/AirPassengers.csv")
data  = data.rename(columns = {"value": "passengers"})
data["month"] = [str(x) for x in pd.to_datetime(data.date).dt.to_period('m')]

In [9]:
data.head()

Unnamed: 0,date,passengers,month
0,1949-01-01,112,1949-01
1,1949-02-01,118,1949-02
2,1949-03-01,132,1949-03
3,1949-04-01,129,1949-04
4,1949-05-01,121,1949-05


In [10]:
def plotting(title, data, x, y, x_label, y_label):
    """General function to plot the passenger data."""
    fig = px.line(data, x=data[x], y=data[y], labels={x: x_label, y: y_label})

    fig.update_layout(template="simple_white", font=dict(size=18),
                      title_text=title, width=650,
                      title_x=0.5, height=400)

    fig.show()

In [11]:
# Plot the airline passenger data
plotting(title='Airline Passengers', 
         data=data, 
         x='month',
         y='passengers', 
         x_label='date', 
         y_label='passengers')

Clearly non stationary (visually). To prove statistically, lets do ADF test.

#### Augmented Dickey-Fuller (ADF) test
H0: Series is non-stationary




In [12]:
from statsmodels.tsa.stattools import adfuller

def test_adf(sequence):
    res = adfuller(sequence)
    print('Statistic: ', res[0])
    print('p-value: ', res[1])
    print('critical values:')
    for threshold, statistic in res[4].items():
        print('\t%s: %.2f' % (threshold, statistic))


In [14]:
test_adf(data["passengers"])

Statistic:  0.8153688792060498
p-value:  0.991880243437641
critical values:
	1%: -3.48
	5%: -2.88
	10%: -2.58


p-value is 0.99 => fail to reject H0 . ie Series is non stationary as expected

#### To make it stationary

##### Try monthly differencing

$ d(t) = passengers(t) - passengers(t-1)$

In [15]:
data["pass_diff_1m"] = data["passengers"].diff(1)

In [17]:
plotting(title='Airline Passengers(first diff)', 
         data=data, 
         x='month',
         y='pass_diff_1m', 
         x_label='date', 
         y_label='pass_diff_1m')

Visually doesnt look stationary at all. Variance is also increasing. Let us verify by ADF as well

In [23]:
test_adf(data["pass_diff_1m"].dropna())

Statistic:  -2.8292668241700047
p-value:  0.05421329028382478
critical values:
	1%: -3.48
	5%: -2.88
	10%: -2.58


p-value > 0.05 => Fail to reject null hypothesis => Series is non stationary as expected

To stabilize variance we can try logarithm trasform

In [24]:
data["pass_log"] = np.log(data["passengers"])

In [26]:
plotting(title='Airline Passengers(Log passengers)', 
         data=data, 
         x='month',
         y='pass_log', 
         x_label='date', 
         y_label='pass_log')

Variance seems stabilized, we can try differencing this log passengers

In [28]:
data["pass_log_diff"] = data["pass_log"].diff(1)

In [29]:
plotting(title='Airline Passengers(Log Diff passengers)', 
         data=data, 
         x='month',
         y='pass_log_diff', 
         x_label='date', 
         y_label='pass_log_diff')

Mean and variance look constant with time but still cyclic

In [30]:
test_adf(data["pass_log_diff"].dropna())

Statistic:  -2.717130598388114
p-value:  0.07112054815086184
critical values:
	1%: -3.48
	5%: -2.88
	10%: -2.58


still seems non-stationary to me, as p-value = 0.07 (I would expect 95% confidence)

Let us try 6m diff.. 



In [31]:
data["pass_log_diff_6"] = data["pass_log"].diff(6)

In [32]:
test_adf(data["pass_log_diff_6"].dropna())

Statistic:  -3.2655285264838154
p-value:  0.016491446253817217
critical values:
	1%: -3.48
	5%: -2.88
	10%: -2.58


In [33]:
plotting(title='Airline Passengers(Log Diff passengers - 6m)', 
         data=data, 
         x='month',
         y='pass_log_diff_6', 
         x_label='date', 
         y_label='pass_log_diff_6')

Still looks cyclic

In [35]:
data["pass_log_diff_12"] = data["pass_log"].diff(12)

In [36]:
plotting(title='Airline Passengers(Log Diff passengers - 12m)', 
         data=data, 
         x='month',
         y='pass_log_diff_12', 
         x_label='date', 
         y_label='pass_log_diff_12')

In [37]:
test_adf(data["pass_log_diff_12"].dropna())

Statistic:  -2.7095768189885687
p-value:  0.07239567181769489
critical values:
	1%: -3.49
	5%: -2.89
	10%: -2.58
