Therefore, the null hypothesis of the ADF is H0:Y = 0
against the alternative hypothesis H1:Y < 0. In other words, the null hypothesis is presence of unit root or non-stationarity, whereas the alternate hypothesis suggests stationarity of the data.

The number of lags m to be included in the regression is usually set to three under the assumption that differencing of the order higher than third order differencing would be rarely needed to stationarize a time series.

In [1]:
from __future__ import print_function
import os
import pandas as pd
from statsmodels.tsa import stattools
%matplotlib notebook
from matplotlib import pyplot as plt

In [2]:
#Set current directory and work relative to it
os.chdir('E:/gitlab/project_on_python/deep time series forcasting/Practical-Time-Series-Analysis-master')

In [3]:
#read the data from into a pandas.DataFrame
air_miles = pd.read_csv('datasets/us-airlines-monthly-aircraft-miles-flown.csv')
air_miles.index = air_miles.Month

In [4]:
#Let's find out the shape of the DataFrame
print('Shape of the DataFrame:', air_miles.shape)

Shape of the DataFrame: (97, 2)


In [5]:
#Let's see first 10 rows of it
air_miles.head(10)

Unnamed: 0_level_0,Month,U.S. airlines: monthly aircraft miles flown (Millions) 1963 -1970
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
1963-01,1963-01,6827.0
1963-02,1963-02,6178.0
1963-03,1963-03,7084.0
1963-04,1963-04,8162.0
1963-05,1963-05,8462.0
1963-06,1963-06,9644.0
1963-07,1963-07,10466.0
1963-08,1963-08,10748.0
1963-09,1963-09,9963.0
1963-10,1963-10,8194.0


In [6]:
#Let's rename the 2nd column
air_miles.rename(columns={'U.S. airlines: monthly aircraft miles flown (Millions) 1963 -1970':\
                          'Air miles flown'
                         },
                inplace=True
                )

In [7]:
#Check for missing values and remove the row
missing = pd.isnull(air_miles['Air miles flown'])
print('Number of missing values found:', missing.sum())
air_miles = air_miles.loc[~missing, :]

Number of missing values found: 1


It is evident that the time series has an uptrend as
well as seasonality and therefore is non-stationary, which will be verified by the ADF test.

In [8]:
#Plot the time series of air miles flown
fig = plt.figure(figsize=(5.5, 5.5))
ax = fig.add_subplot(1,1,1)
air_miles['Air miles flown'].plot(ax=ax)
ax.set_title('Monthly air miles flown during 1963 - 1970')
# plt.savefig('plots/ch2/B07887_02_13.png', format='png', dpi=300)
plt.show()

<IPython.core.display.Javascript object>

In [9]:
adf_result = stattools.adfuller(air_miles['Air miles flown'], autolag='AIC')

In [10]:
print('p-val of the ADF test in air miles flown:', adf_result[1])

p-val of the ADF test in air miles flown: 0.9945022811234028


argument autolag='AIC' instructs the function to choose a suitable number of lags for the test by maximizing the Akaike Information Criteria (AIC). Alternately, the test can run on the number of lags given by the user into the keyword argument maxlag. We prefer using the AIC over giving a lag to avoid trial and error in finding the best lag required for running the test.

In [11]:
adf_result

(1.0229489778119756,
 0.9945022811234028,
 11,
 84,
 {'1%': -3.510711795769895,
  '5%': -2.8966159448223734,
  '10%': -2.5854823866213152},
 1356.2366247658094)

usedlag, which is the number of lags actually used for running the test and critical values of the test statistic at 1%, 5%, and 10% levels of confidence

Null hypothesis is accepted