# The Dickey-Fuller Test

Previously, you learned how to detect the presence of trend and/or seasonality using charts and decompostion.  It is determined that if either are present in the data, then the time series is probably not stationary.

Let's revist the time series dataset containing the number of airline passengers traveling in a particular month.  Remember to prepare the time series data for analysis by
- converting the *Month* column to data type datetime
- setting the *Month* column as the index of the time series

In [25]:
import pandas as pd
passengers = pd.read_csv('https://mathatwork.org/DATA/airpassengers.csv')

from datetime import datetime
passengers.Month = pd.to_datetime(passengers.Month)
passengers.index = passengers.Month
passengers = passengers.drop('Month', axis=1)
print(passengers.head())

            Passengers
Month                 
1949-01-01         112
1949-02-01         118
1949-03-01         132
1949-04-01         129
1949-05-01         121


Nice!  Recall that decompostion revealed a clear presence of both trend and seasonality in the data which provided evidence that the data is probably non-stationary.
<br><br>
Let's apply Dickey-Fuller to test this hypothesis.

In [31]:
from statsmodels.tsa.stattools import adfuller

dftest = adfuller(passengers.Passengers)
pvalue = pd.Series(dftest[1], index=['p-value'])

print(pvalue)

p-value    0.99188
dtype: float64


In Python, adfuller is the Dickey-Fuller test.  In the code above, **adfuller** was imported from the *statsmodels* library.  After the import, the adfuller test was run on the *Passengers* column of the *passengers* DataFrame.  Results from the adfuller test were stored in a new DataFrame we called *dftest*.  
<br>
The adfuller returns a few other statistics, but we only stored the *p-value* in a new DataFrame called *pvalue*.  Look [here](https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html) for a list of returned statistics of adfuller.  We then printed *pvalue* to the screen.

The p-value here is 0.99 which means that the probability that this time series is non-stationary is 99%.  This is a very high probability.  If you decided, for example, on a confidence level of 95%,  then alpha would be 0.05.  In this case, since the p-value = 0.99 is NOT LESS THAN alpha = 0.05, you can conclude that the time series is indeed non-stationary with 95% confidence.

# Differencing For Estimating Trend and Seasonality

When applying differencing for estimation, you should begin with **first order differencing**.  That is where lag=1.  This means you take your original data series and subtract a new series created by shifting the original series's data elements over by 1 index.
<br><br>
For example, consider the following series:

In [12]:
my_series = pd.DataFrame([10, 20, 15, 30, 45])

Apply Python's **.shift( )** method and pass in *periods=1* to specify a lag equal to 1.

In [13]:
my_series_lag1 = my_series.shift(periods=1)
print(my_series_lag1)

      0
0   NaN
1  10.0
2  20.0
3  15.0
4  30.0


**.shift(periods=1)** shifted *my_series* data items over (or down) by 1 index.  Notice that consequently there is no data at ROW INDEX 0.  Subtract the lagged series from the original series to get first order differencing.

In [14]:
first_diff = my_series - my_series_lag1
print(first_diff)

      0
0   NaN
1  10.0
2  -5.0
3  15.0
4  15.0


Notice that there is still no data at ROW INDEX 0.  The resulting DataFrame is the element by element difference between *my_series* and *my_series_lag1*.  
<br>
Similarly, to get **second order differencing** you take your original data series and subtract a new series created by shifting the original series's data elements over by 2 indices. 

In [15]:
my_series_lag2 = my_series.shift(periods=2)
print(my_series_lag2)

      0
0   NaN
1   NaN
2  10.0
3  20.0
4  15.0


**.shift(periods=2)** shifted *my_series* data items over (or down) by 2 indices.  Notice that consequently there is no data at ROW INDICES 0 and 1.  Subtract the lagged series from the original series to get second order differencing.

In [16]:
second_diff = my_series - my_series_lag2
print(second_diff)

      0
0   NaN
1   NaN
2   5.0
3  10.0
4  30.0


Viola!  Follow the same logic for higher order differencing.  
<br>
Now let's apply first order differencing to the airline passengers data.

In [27]:
passengers_lag1 = passengers.shift(periods=1)
passengers_first_diff = passengers - passengers_lag1
print(passengers_first_diff.head())

            Passengers
Month                 
1949-01-01         NaN
1949-02-01         6.0
1949-03-01        14.0
1949-04-01        -3.0
1949-05-01        -8.0


We expected there to be no data at ROW INDEX 0.  However, this will cause problems in our analysis, so let's use Panda's **.dropna( )** to drop the NAN.  Pass in *inplace=True* to perform the drop on the *passengers_first_diff* DataFrame itself and not a copy of it.

In [29]:
passengers_first_diff.dropna(inplace=True)
print(passengers_first_diff.head())

            Passengers
Month                 
1949-02-01         6.0
1949-03-01        14.0
1949-04-01        -3.0
1949-05-01        -8.0
1949-06-01        14.0


Great!  After applying first order differencing, we hope the resulting time series is now stationary.  Let's use Dickey-Fuller on the differenced time series to check stationarity.

In [32]:
dftest2 = adfuller(passengers_first_diff.Passengers)
pvalue2 = pd.Series(dftest2[1], index=['p-value'])

print(pvalue2)

p-value    0.054213
dtype: float64


The p-value here is 0.054 which means that the probability that this time series is non-stationary is 5.4%. This is a very low probability. If you decided, for example, on a confidence level of 90%, then alpha would be 0.10. In this case, since the p-value = 0.054 is LESS THAN alpha = 0.10, you can conclude that the time series is indeed stationary with 90% confidence.

This is great because the differenced time series could now be modeled using either regression models or ARIMA models since both require stationarity. 

### Exercise 

Recall the time series dataset containing monthly number of sales of shampoo over a 3 year period for a UK-based online store.

In [33]:
sales = pd.read_csv('https://mathatwork.org/DATA/sales-shampoo.csv')
print(sales.head())

     Month  Sales
0  2015-01  266.0
1  2015-02  145.9
2  2015-03  183.1
3  2015-04  119.3
4  2015-05  180.3


**1)** Prepare the time series data for analysis by
- converting the *Month* column to data type datetime
- setting the *Month* column as the index of the time series

In [34]:
sales.Month = pd.to_datetime(sales.Month)
sales.index = sales.Month
sales = sales.drop('Month', axis=1)
print(sales.head())

            Sales
Month            
2015-01-01  266.0
2015-02-01  145.9
2015-03-01  183.1
2015-04-01  119.3
2015-05-01  180.3


**2)** Assuming previously that decomposition revealed a clear presence of both trend and seasonality in the data which provided evidence that the data is probably non-stationary, apply Dickey-Fuller to test this hypothesis. Explain in the cell below your analysis your interpretation of the *p-value* at an 85% confidence level.

In [35]:
dftest3 = adfuller(sales.Sales)
pvalue3 = pd.Series(dftest3[1], index=['p-value'])

print(pvalue3)

p-value    1.0
dtype: float64


**3)** Apply first order differencing to the sales data.

In [37]:
sales_lag1 = sales.shift(periods=1)
sales_first_diff = sales - sales_lag1
sales_first_diff.dropna(inplace=True)

print(sales_first_diff.head())

            Sales
Month            
2015-02-01 -120.1
2015-03-01   37.2
2015-04-01  -63.8
2015-05-01   61.0
2015-06-01  -11.8


**4)** Apply the Dickey-Fuller test to recheck stationarity for the differenced sales data.  Explain in the cell below your analysis your interpretation of the p-value at a 95% confidence level.

In [38]:
dftest4 = adfuller(sales_first_diff.Sales)
pvalue4 = pd.Series(dftest4[1], index=['p-value'])

print(pvalue4)

p-value    1.799857e-10
dtype: float64
