# <font color='#eb3483'> Predicting Time Series </font>

We have seen ways to decompose and model time series. The next step of course, is to **forecast the future**

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
sns.set(rc={'figure.figsize':(16,5)})

import warnings
warnings.filterwarnings("ignore")

In [None]:
from random import gauss
from random import seed
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.seasonal import seasonal_decompose

In [None]:
from utils import load_airlines_series
from utils import load_electricity_consumption_series
from utils import load_shampoo_series

## <font color='#eb3483'>White Noise vs. Signal  </font>

We are going to generate some datasets

In [None]:
mysterious_data_1 = load_electricity_consumption_series()['consumption']
mysterious_data_2 = pd.Series([gauss(0.0, 1.0) for i in range(1000)])
mysterious_data_2.name = 'White noise'
mysterious_data_2.index = pd.date_range(pd.datetime(1971, 12, 25, hour=12, minute=0), periods=1000, freq='3d')

Timeseries are sneaky things. Sometimes noise looks like data, and data looks like noise. 

A frequent problem with timeseries is answering the open, yet crucial question: _"is there any kind of signal here?"_ 

In [None]:
mysterious_data_1.plot();

In [None]:
mysterious_data_2.plot();

This already gives us a visual clue that we seem to have more structure in the first one.

We can continue to search for structure by decomposing the signal, as you've learned in the previous notebook:

In [None]:
decomposition = seasonal_decompose(mysterious_data_1, model='additive')
decomposition.plot();

Let's see the 2nd dataset

In [None]:
decomposition = seasonal_decompose(mysterious_data_2, model='additive')
decomposition.plot();

Well.. mysterious data 2 seems more mysterious than mysterious data 1, but this is all very subjective and annoying. Maybe we have a more objective way to look for structure?
<hr>

## <font color='#eb3483'> Similarity with the past </font>

We usually create models to forecast based on previous endogenous observations, we say that time series are **autoregressive**, that is, the current prediction is a result of the previous periods.

For that kind of modelling, we need to choose how many steps in the past (i.e. lags) our model will consider to produce a prediction. In order to choose how many lags we will use, we have several tools at our disposal. In this section, we will present two of them: lags scatter plot and autocorrelation, both of them visual approaches.

### <font color='#eb3483'> 1. Lags Scatterplot </font>
A very simple way to visually assess the temporal relationships within a time series is through the use of lags scatter: draw a scatter plot between the $y(t)$ and $y(t-lag)$

In [None]:
from pandas.plotting import lag_plot

In [None]:
mysterious_data_1.head()

In [None]:
lag_plot(mysterious_data_1);

In [None]:
lag_plot(mysterious_data_1, lag=6);

In [None]:
lag_plot(mysterious_data_1, lag=12);

We can see there is some correlation for a lag of 12 months, which makes sense since annual cycles are the most common ones

### <font color='#eb3483'> 2. ACF </font>

ACF stands for Auto Correlation Function.


The idea, in plain English, is very simple: "how correlated is each datapoint, to the datapoints lagged x periods?"

###  <font color='#eb3483'>  2.1 ACF by hand </font>

In [None]:
mysterious_data_1.head()

if we lag the dataset by 1 period

In [None]:
mystery_lag_1 = mysterious_data_1.shift(1)
mystery_lag_1.head()

So... the data that was at 1971-01-01 is now at 1971-02-01, and so on and so forth. Fancy.

How correlated are mystery and mystery_lag_1?

In [None]:
mysterious_data_1.corr(mysterious_data_1.shift(1))

Ok... how about if we lag it two times? 

In [None]:
mysterious_data_1.corr(mysterious_data_1.shift(2))

Negatively correlated. Let's get a bunch of these, for different values: 

In [None]:
corrs = {}
for lag in range(40):
    corrs[lag] = mysterious_data_1.corr(mysterious_data_1.shift(lag))
    
pd.Series(corrs).head()

In [None]:
pd.Series(corrs).plot(kind='bar')
sns.mpl.pyplot.xlabel('Lag')
sns.mpl.pyplot.ylabel('Correlation between original and lagged series');

Wooow! We can see the structure! Every 12 months (year!) we get really high correlation, and maybe even some yearly seasons here. Cool bananas.

### <font color='#eb3483'> 2.2 ACF with stats model </font>

ACF is so useful, that `statsmodel` actually comes with functions to calculate and to draw them. It also gives you something super useful, which are pre-calculated confidence intervals to get an idea of how significant the auto-correlation is:

In [None]:
from statsmodels.tsa.stattools import acf
from statsmodels.graphics.tsaplots import plot_acf

In [None]:
acf(mysterious_data_1)

Which is what we did manually. We can also get a fancy plot

In [None]:
plot_acf(mysterious_data_1, alpha=.05)
sns.mpl.pyplot.xlabel('lag')
sns.mpl.pyplot.ylabel('Autocorrelation');

The shaded area in the chart above is the confidence bound. By passing the parameter alpha=0.05 we told the plot to give us the 95% confidence interval.

(Remember! If you're 95% confident, then you're going to be wrong once every 20 times, so take the confidence interval with that pinch of salt.)

By looking at this plot we can tell that there is some clear seasonal behavior, and that it is quite significant around the 12 mark.

What happens if there is no structure?
So, we said this helped find structure. What happens if we apply this to the sneaky looking mysterious_data_2?




In [None]:
plot_acf(mysterious_data_2, alpha=.05)
sns.mpl.pyplot.xlabel('lag')
sns.mpl.pyplot.ylabel('Autocorrelation');

That is how you know that noise, is noise :)

##  <font color='#eb3483'> 3. PACF </font>

[PACF](https://stats.stackexchange.com/questions/129052/acf-and-pacf-formula/129374#129374) stands for Partial Auto Correlation Function.

To illustrate the point of this, I'm going to plot the old ACF for `mysterious_data_1`: 

In [None]:
plot_acf(mysterious_data_1, alpha=.05)
sns.mpl.pyplot.xlabel('lag')
sns.mpl.pyplot.ylabel('Autocorrelation');

Isn't there something annoying about this plot? Lag 12 has a high auto-correlation. But so does lag 24. And lag 36.  So we're kind of "recycling" auto-correlation from the previous year. 

Walk into a room and tell your boss the following, and you might rightfully be laughed at:
> _"I've found something! The patterm seems to happen ever week, and every two weeks, and every 3 weeks, and every 4 weeks!"_



In [None]:
from statsmodels.tsa.stattools import pacf
from statsmodels.graphics.tsaplots import plot_pacf

In [None]:
plot_pacf(mysterious_data_1, alpha=.05, method='ywmle')
sns.mpl.pyplot.xlabel('lag')
sns.mpl.pyplot.ylabel('Autocorrelation');

Ah, much better. Easier to isolate that 12. And, by the way, why the hell are we looking at 100 months...? 

In [None]:
plot_pacf(mysterious_data_1, alpha=.05, lags=40, method='ywmle')
sns.mpl.pyplot.xlabel('lag')
sns.mpl.pyplot.ylabel('Autocorrelation');

<hr>
<br>

##  <font color='#eb3483'> In search of stationarity </font>

What is stationarity? 

> **_"A stationary time series is one whose statistical properties such as mean, variance, autocorrelation, etc. are all constant over time."_**

Stationarity is a holy grail in timeseries, specially if you are using the type of models we will show you next. If a process is stationary, then you can make cool predictions. We can (and will) transform our timeseries until they are stationary processes.

First, let's look at a clearly non stationary timeseries , a dataset containing tractor sales by month

In [None]:
sales = pd.read_csv('data/Tractor-Sales.csv')['Number of Tractor Sold']
sales.head()

In [None]:
sales.plot();

**Is the mean constant over time?**
Nope.

**Is the variance constant over time?**
Oh hell no.

**Is the auto-correlation constant over time?**
Probably not...

Let's beat this timeseries into submission until it becomes stationary! 

###  <font color='#eb3483'> Removing trend </font>

The first thing we can clearly see in this timeseries is that it has a trend. A trivial way to remove the trend is to take the lag 1, and subtract it. In other words, instad of using the series, we will use the difference beween consecutive observations. 

Difference, you say? difference... Ah! [Diff!](http://pandas.pydata.org/pandas-docs/version/0.17/generated/pandas.Series.diff.html)

In [None]:
sales_diff = sales.diff(periods=1)

In [None]:
sales.head(10)

In [None]:
sales_diff.head(10)

In [None]:
sales_diff.plot();

It looks more stable, but its not stationary yet, because the variance is growing with time. 

We can take care of that by applying a log transform to the data. 

(**Note**: *we will do the log transform first, and then the diff. The reason is that the diff has a tendency to place results at zero, which will prove problematic with the log transform*. 

*So please notice: this is being performed on the original dataset, not on sales_diff.* )

In [None]:
sales_logged = sales.map(np.log)

What does this look like? 

In [None]:
sales.plot(legend='Original', ls=':')
sales_logged.plot(figsize=(16, 5), secondary_y=True)
sns.mpl.pyplot.title('Logged sales (original sales dotted)');

Notice that the variance in the original timeseries kept growing, but our logged timeseries has constant variance! 

Now we can diff it:

In [None]:
sales_logged_diff = sales_logged.diff()

In [None]:
sales_logged_diff.plot();

Hmm... mean looks constant over time, variance seems constant over time... looks like we managed! 

This was clearly an easy timeseries to make stationary. Most timeseries require a lot more work. There are approaches such as removing a moving average that are more powerful, and tend to work quite well. 

However, remember this: whatever transformation you do to your timeseries in your attempt to make it stationary should be one you can reverse later. If your boss asks you 
> _"How many tractors are we doing to sell in 3 weeks?"_


Answering this won't get you far: 
> _"On that week, I predict a logged diff of -0.23."_

So whenever you are transforming your timeseries, you need to keep these transformations reversible, and the more complex the transformation, the more complex it is to invert.

###  <font color='#eb3483'> Evaluating stationarity </font>

Is it stationary? It... "looks stationary". 

A robust method to assess stationarity is to use the **[Dickey-Fuller test](http://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html#statsmodels.tsa.stattools.adfuller)**: 

In [None]:
from statsmodels.tsa.stattools import adfuller

The Dickey-Fuller test is a statistical test where the null hypothesis is that the [Unit Root](https://en.wikipedia.org/wiki/Unit_root) is present. The details of this are interesting yet out of scope, but suffice it to say for our purposes that if the unit root isn't present, the timeseries can be assumed to be stationary. 

So, for the time being, _"Unit Root is bad"_, and _"no Unit Root is good"_. If the pvalue is above a critical size, then we cannot reject that there is a unit root. 
 
 So... we want a low p value. Got that? Great. 

In [None]:
adfuller(sales_logged_diff)

It seems our series has missing data, basically because when you diff the first n elements (with n being the lag) are null.

In [None]:
sales_logged_diff.head(3)

We compute it again without the nans

In [None]:
adfstat, pvalue, usedlag, nobs, critvalues, icbest = adfuller(sales_logged_diff.dropna())

Ok, I know, so many returns. It's worth taking a look at [the documentation](http://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html#statsmodels.tsa.stattools.adfuller), but for our purposes, we're going to use it in a ridiculously simple way: 

In [None]:
print('Statistic: %0.02f' % adfstat)
print('pvalue:    %0.03f' % pvalue)

The p value is low, but not as low as we'd like it to be. We can't reject that we have unit root at a 95% confidence interval, which is where we like to have these things. So the timeseries is not statistically stationary

How about we diff one more time?

In [None]:
sales_logged_and_diffed_twice = sales_logged_diff.diff().dropna() # taking a second diff 
adfstat, pvalue, usedlag, nobs, critvalues, icbest = adfuller(sales_logged_and_diffed_twice)
print('Statistic: %0.02f' % adfstat)
print('pvalue:    %0.03f' % pvalue)

In [None]:
sales_logged_and_diffed_twice.plot();