<a id="0"></a> <br>
## Kernel Headlines
1. [Introduction and RoadMap](#1)
    1. [What makes Time Series Special](#2)
	2. [Be Careful about Our Approach](#3)
	3. [Imports and Loading Data](#3)
	4. [Simple Functionalities of Time Series](#4)
	
2. [Stationary in Time Series](#5)
    1. [What does stationary means in time series](#6)
	2. [Stationary Parameters](#7)
	3. [Dickey Fuller Test](#8)
	4. [Make a Time Serie Stationary](#9)
	5. [Estimating and Eliminating Trend](#10)
	6. [Moving Average](#11)
	7. [Weighted Moving Average](#12)
	8. [Eliminating Trend and Seasonality](#13)
	    1. [Differencing](#14)
		2. [Decomposition](#15)
    9. [Forecasting TimeSeries](#16)
       1. [Introduction To ARIMA](#17)
	   2. [Autocorrelation Function](#18)
	   3. [Partial Autocorrelation Function(PACF)](#19)
	   4. [AR model](#20)
	   5. [MA model](#21)
	   6. [Combined Model](#22)


5. [References](#100)

<a id="0"></a> <br>
#  1-Introduction and RoadMap

Our journey would go through the following steps:
 * What makes Time Series Special?
 * Loading and Handling Time Series in Pandas
 * How to Check Stationarity of a Time Series?
 * How to make a Time Series Stationary?
 * Forecasting a Time Series

I have used [This Tutorial](https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/) as a main idea for this kernel.  You can find more details there if you are motivated. 

<a id="1"></a> <br>
**A. WHAT MAKES TIME SERIES**

As the name suggests, TS is a collection of data points collected at constant time intervals. These are analyzed to determine the long term trend so as to forecast the future or perform some other form of analysis. But what makes a TS different from say a regular regression problem? There are 2 things:

1. It is time dependent. So the basic assumption of a linear regression model that the observations are independent doesn’t hold in this case.
2. Along with an increasing or decreasing trend, most TS have some form of seasonality trends, i.e. variations specific to a particular time frame. For example, if you see the sales of a woolen jacket over time, you will invariably find higher sales in winter seasons.

<a id="2"></a> <br>
**B. BE CAREFUL ABOUT OUR APPROACH**

Please note that the aim of this kernel is to familiarize you with the various techniques used for TS in general. Our main focus is analyzing the problem with TimeSeries approches. So, the results may be a little ambiguous. Dont hesitate to share your idea in comments. Any hint will be appreciated.

<a id="3"></a> <br>
**C. IMPORTS AND LOADING DATA**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from datetime import datetime
%matplotlib inline
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6
from kaggle.competitions import twosigmanews

In [None]:
env = twosigmanews.make_env()

In [None]:
market_data = env.get_training_data()[0]

<a id="4"></a> <br>
**D. SIMPLE FUNCTIONALITIES OF TIME SERIES **

In [None]:
fig,axes = plt.subplots(1,1,figsize=(15,10))
axes.set_title("Time Distro")
axes.set_ylabel("# of records")
axes.set_xlabel("date")
axes.plot(market_data.time.dt.date.value_counts().sort_index().index, market_data.time.dt.date.value_counts().sort_index().values)

In [None]:
market_data.head()

In [None]:
time_series_df = market_data[["time"]].groupby(by=["time"]).size()

In [None]:
time_series_df.index

As you can see, the index format is 'Index'. Lets change it to datetime index to being capable of using TimeSeries package facilities.

In [None]:
time_series_df.index = pd.to_datetime(time_series_df.index)

In [None]:
time_series_df.index

Now, its type is 'DatetimeIndex. Lets move on and continue ;-)'

Now you can easily access to the values by the date you want in more easily way

In [None]:
time_series_df["2007-02-01"]

There is also possible to do a range query.

In [None]:
time_series_df["2007-02-01" : "2007-02-10"]

Or also customized date range queries.

In [None]:
time_series_df["2007-02"]

<a id="5"></a> <br>
#  2-Stationary in Time Series


<a id="6"></a> <br>
**A. WHAT DOES STATIONARY MEANS IN TIME SERIES**

A TS is said to be stationary if its statistical properties such as mean, variance remain constant over time.  **If the data is stationary we can anticipate that the treatment of users in the future will be similar to the old records.** It is completely obvious that analyzing the stationary data is easier than non-stationary one.

Return back to the time plot we had in previous section. Is the data stationary ?!

<a id="7"></a> <br>
**B. STATIONARY PARAMETERS**

Stationarity is defined using very strict criterion. However, for practical purposes we can assume the series to be stationary if it has constant statistical properties over time, ie. the following:

 1. constant mean
 2. constant variance
 3. An autocovariance that does not depend on time.


**As mentioned before the main purpose of the kernel is learning, so we will use part of data which is completely compatible with TimeSeries approaches. As it is obvious the range of 2007 to 2009 have periodic treatment could be analyzed with TimeSeries approaches.**

In [None]:
fig,axes = plt.subplots(1,1,figsize=(15,10))
axes.set_title("Time Distro")
axes.set_ylabel("# of records")
axes.set_xlabel("date")
axes.plot(time_series_df["2007" : "2008-9"])

Ignore the missing values we have in end period of 2015. The increasing treatment can be observed. In the other words, it is overall increasing trend.
However, it might not always be possible to make such visual inferences (we’ll see such cases later). So, more formally, we can check stationarity using the following:
Plotting Rolling Statistics:
We can plot the moving average or moving variance and see if it varies with time. By moving average/variance I mean that at any instant ‘t’, we’ll take the average/variance of the last year, i.e. last 12 months. But again this is more of a visual technique.

<a id="8"></a> <br>
**C. DICKEY-FULLER TEST**

This is one of the statistical tests for checking stationarity. Here the null hypothesis is that the TS is non-stationary. The test results comprise of a Test Statistic and some Critical Values for difference confidence levels. If the ‘Test Statistic’ is less than the ‘Critical Value’, we can reject the null hypothesis and say that the series is stationary

In [None]:
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):
    
    #Determing rolling statistics
#     rolmean = pd.rolling_mean(timeseries, window=12)
#     rolstd = pd.rolling_std(timeseries, window=12)
    
    rolmean = timeseries.rolling(window=120).mean()
    rolstd = timeseries.rolling(window=120).std()

    #Plot rolling statistics:
    plt.figure(figsize=(20,10))
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean & Standard Deviation')
    plt.show(block=False)
    
    #Perform Dickey-Fuller test:
    print ('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC')
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    print (dfoutput)

As you can see in the following code, we have plotted original data, mean and standar deviation of it in one figure.
Now, lets check is our data stationary ?!

In [None]:
time_series_df = time_series_df["2007" : "2008-9"]
test_stationarity(time_series_df)

Though the variation in standard deviation is small, mean is clearly increasing with time and this is not a stationary series. Also, the test statistic is way more than the critical values. Note that the signed values should be compared and not the absolute values.

Next, we’ll discuss the techniques that can be used to take this TS towards stationarity.

<a id="9"></a> <br>
**D. MAKE A TIME SERIE STATIONARY** 

Though stationarity assumption is taken in many TS models, almost none of practical time series are stationary. So statisticians have figured out ways to make series stationary, which we’ll discuss now. Actually, its almost impossible to make a series perfectly stationary, but we try to take it as close as possible.

Lets understand what is making a TS non-stationary. There are 2 major reasons behind non-stationaruty of a TS:
1. Trend – varying mean over time. For eg, in this case we saw that on average, the number of passengers was growing over time.
2. Seasonality – variations at specific time-frames. eg people might have a tendency to buy cars in a particular month because of pay increment or festivals.

The underlying principle is to model or estimate the trend and seasonality in the series and remove those from the series to get a stationary series. Then statistical forecasting techniques can be implemented on this series. The final step would be to convert the forecasted values into the original scale by applying trend and seasonality constraints back.

Note: I’ll be discussing a number of methods. Some might work well in this case and others might not. But the idea is to get a hang of all the methods and not focus on just the problem at hand.

Let’s start by working on the trend part.

<a id="10"></a> <br>
**E. ESTIMATING AND ELIMINATING TREND**

One of the first tricks to reduce trend can be transformation. For example, in this case we can clearly see that the there is a significant positive trend. So we can apply transformation which penalize higher values more than smaller values. These can be taking a log, square root, cube root, etc. Lets take a log transform here for simplicity:

(ingore the interval that there is a no data)

In [None]:
fig,axes = plt.subplots(1,1,figsize=(15,10))
axes.set_title("Time Distro")
axes.set_ylabel("LOG(# of records)")
axes.set_xlabel("date")
axes.plot(time_series_df)
ts_log = np.log(time_series_df)
axes.plot(ts_log)

In this simpler case, it is easy to see a forward trend in the data. But its not very intuitive in presence of noise. So we can use some techniques to estimate or model this trend and then remove it from the series. There can be many ways of doing it and some of most commonly used are:

1. Aggregation – taking average for a time period like monthly/weekly averages
2. Smoothing – taking rolling averages
3. Polynomial Fitting – fit a regression model

I will discuss smoothing here and you should try other techniques as well which might work out for other problems. Smoothing refers to taking rolling estimates, i.e. considering the past few instances. There are can be various ways but I will discuss two of those here.

<a id="11"></a> <br>
**F. Moving average**

In this approach, we take average of ‘k’ consecutive values depending on the frequency of time series. Here we can take the average over the past 1 year, i.e. last 12 values. Pandas has specific functions defined for determining rolling statistics.


In [None]:
fig,axes = plt.subplots(1,1,figsize=(15,10))
axes.set_title("Time Distro")
axes.set_ylabel("LOG(# of records)")
axes.set_xlabel("date")

moving_avg = ts_log.rolling(6).mean()
axes.plot(ts_log)
axes.plot(moving_avg, color='red')

The red line shows the rolling mean. Lets subtract this from the original series. Note that since we are taking average of last 6 values, rolling mean is not defined for first 5 values. This can be observed as:

In [None]:
ts_log_moving_avg_diff = ts_log - moving_avg
ts_log_moving_avg_diff.head(6)

Notice the first 6 being Nan. Lets drop these NaN values and check the plots to test stationarity.

In [None]:
ts_log_moving_avg_diff.dropna(inplace=True)
test_stationarity(ts_log_moving_avg_diff)

This looks like a much better series. The rolling values appear to be varying slightly but there is no specific trend. Also, the test statistic is smaller than the 1% critical values so we can say with 99% confidence that this is a stationary series.

<a id="12"></a> <br>
**G. WEIGHTED MOVING AVERAGE**

However, a drawback in this particular approach is that the time-period has to be strictly defined. In this case we can take 6 month averages but in complex situations like forecasting a stock price, its difficult to come up with a number. So we take a ‘weighted moving average’ where more recent values are given a higher weight. There can be many technique for assigning weights. A popular one is exponentially weighted moving average where weights are assigned to all the previous values with a decay factor.  This can be implemented in Pandas as:

In [None]:
# # expwighted_avg = pd.ewma(ts_log, halflife=12)
# expwighted_avg = pd.DataFrame(ts_log).ewm(halflife=12).mean()
fig,axes = plt.subplots(1,1,figsize=(15,10))
axes.set_title("Expolatiornay Moving Average")
axes.set_ylabel("LOG(# of records)")
axes.set_xlabel("date")
expwighted_avg = ts_log.ewm(halflife=6).mean()
plt.plot(ts_log)
plt.plot(expwighted_avg, color='red')

Note that here the parameter ‘halflife’ is used to define the amount of exponential decay. This is just an assumption here and would depend largely on the business domain. Other parameters like span and center of mass can also be used to define decay which are discussed in the link shared above. Now, let’s remove this from series and check stationarity:

In [None]:
ts_log_ewma_diff = ts_log - expwighted_avg
test_stationarity(ts_log_ewma_diff)

This TS has even lesser variations in mean and standard deviation in magnitude. Also, the test statistic is smaller than the 1% critical value, which is better than the previous case. Note that in this case there will be no missing values as all values from starting are given weights. So it’ll work even with no previous values.

The simple moving average we have done in previous section return TestStatistic -3.80 but Exploationary Moving Average returns -3.94 for TestStatic. Which is compeltely better than former.

<a id="13"></a> <br>
**H. Eliminating Trend and Seasonality**

The simple trend reduction techniques discussed before don’t work in all cases, particularly the ones with high seasonality. Lets discuss two ways of removing trend and seasonality:

1. Differencing – taking the differece with a particular time lag
2. Decomposition – modeling both trend and seasonality and removing them from the model.


<a id="14"></a> <br>
1. Differencing

One of the most common methods of dealing with both trend and seasonality is differencing. In this technique, we take the difference of the observation at a particular instant with that at the previous instant. This mostly works well in improving stationarity. First order differencing can be done in Pandas as:


In [None]:
fig, axes = plt.subplots(1,1,figsize=(20,10))
ts_log_diff = ts_log - ts_log.shift()
plt.plot(ts_log_diff)

This appears to have reduced trend considerably. Lets verify our method using test_stationary procedure we have done in earlier sections.

In [None]:
ts_log_diff.dropna(inplace=True)
test_stationarity(ts_log_diff)

As you can compare this result with previous ones, the moving average method had a better result in comparison to our method. Another point could be mentioned is this comparasion has been revealed that the assumption of seasonality is not correct for this data shift.

We can see that the mean and std variations have small variations with time. Also, the Dickey-Fuller test statistic is less than the 5% critical value, thus the TS is stationary with 95% confidence. We can also take second or third order differences which might get even better results in certain applications. 

Lets check the same procedure for shifting with 10 preiod.

In [None]:
test_stationarity((ts_log - ts_log.shift(10)).dropna())

As you can see, with 10 period Defferencing the data will be stationary with more than 99% confidentialy.

Now, lets move on to Decomposing

<a id="15"></a> <br>
2. Decomposition

In this approach, both trend and seasonality are modeled separately and the remaining part of the series is returned.

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(ts_log,freq=10)

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

fig,ax = plt.subplots(figsize=(15,20))
plt.subplot(411)
plt.plot(ts_log, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()

Here we can see that the trend, seasonality are separated out from data and we can model the residuals. Lets check stationarity of residuals:

In [None]:
ts_log_decompose = residual
ts_log_decompose.dropna(inplace=True)
test_stationarity(ts_log_decompose)

The Dickey-Fuller test statistic is significantly lower than the 1% critical value. So this TS is very close to stationary. You can try advanced decomposition techniques as well which can generate better results. Also, you should note that converting the residuals into original values for future data in not very intuitive in this case.

<a id="16"></a> <br>
**I. Forecasting TimeSeries**

We saw different techniques and all of them worked reasonably well for making the TS stationary. Lets make model on the TS after differencing as it is a very popular technique. Also, its relatively easier to add noise and seasonality back into predicted residuals in this case. Having performed the trend and seasonality estimation techniques, there can be two situations:

1. A strictly stationary series with no dependence among the values. This is the easy case wherein we can model the residuals as white noise. But this is very rare.
2. A series with significant dependence among values. In this case we need to use some statistical models like ARIMA to forecast the data.

Let me give you a brief introduction to ARIMA. I won’t go into the technical details but you should understand these concepts in detail if you wish to apply them more effectively. ARIMA stands for Auto-Regressive Integrated Moving Averages. The ARIMA forecasting for a stationary time series is nothing but a linear (like a linear regression) equation. The predictors depend on the parameters (p,d,q) of the ARIMA model:

<a id="17"></a> <br>
1. INTRODUCTION TO ARIMA

Number of AR (Auto-Regressive) terms (p): AR terms are just lags of dependent variable. For instance if p is 5, the predictors for x(t) will be x(t-1)….x(t-5).

Number of MA (Moving Average) terms (q): MA terms are lagged forecast errors in prediction equation. For instance if q is 5, the predictors for x(t) will be e(t-1)….e(t-5) where e(i) is the difference between the moving average at ith instant and actual value.

Number of Differences (d): These are the number of nonseasonal differences, i.e. in this case we took the first order difference. So either we can pass that variable and put d=0 or pass the original variable and put d=1. Both will generate same results.

An importance concern here is how to determine the value of ‘p’ and ‘q’. We use two plots to determine these numbers. Lets discuss them first.

<a id="18"></a> <br>
2. AUTOCORRELATION FUNCTION(ACF)

It is a measure of the correlation between the the TS with a lagged version of itself. For instance at lag 5, ACF would compare series at time instant ‘t1’…’t2’ with series at instant ‘t1-5’…’t2-5’ (t1-5 and t2 being end points). Partial 

<a id="19"></a> <br>
3. PARTIAL AUTOCORRELATION FUNCTION (PACF)

This measures the correlation between the TS with a lagged version of itself but after eliminating the variations already explained by the intervening comparisons. Eg at lag 5, it will check the correlation but remove the effects already explained by lags 1 to 4.


In [None]:
#ACF and PACF plots:
from statsmodels.tsa.stattools import acf, pacf
lag_acf = acf(ts_log_diff, nlags=20)
lag_pacf = pacf(ts_log_diff, nlags=20, method='ols')
#Plot ACF: 
plt.subplot(121) 
plt.plot(lag_acf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.title('Autocorrelation Function')

#Plot PACF:
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.title('Partial Autocorrelation Function')
plt.tight_layout()

In this plot, the two dotted lines on either sides of 0 are the confidence interevals. These can be used to determine the ‘p’ and ‘q’ values as:

1. p – The lag value where the PACF chart crosses the upper confidence interval for the first time. If you notice closely, in this case p=1.5
2. q – The lag value where the ACF chart crosses the upper confidence interval for the first time. If you notice closely, in this case q=1.25.

Now, lets make 3 different ARIMA models considering individual as well as combined effects. I will also print the RSS for each. Please note that here RSS is for the values of residuals and not actual series.

We need to load the ARIMA model first

<a id="20"></a> <br>
4. AR MODEL

The p,d,q values can be specified using the order argument of ARIMA which take a tuple (p,d,q). Let model the 3 cases:

In [None]:
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(ts_log, order=(2, 1, 0))  
results_AR = model.fit(disp=-1)  
fig,axes = plt.subplots(1,1,figsize=(20,10))
plt.plot(ts_log_diff)
plt.plot(results_AR.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_AR.fittedvalues-ts_log_diff)**2))

<a id="21"></a> <br>
5. MA MODEL


In [None]:
model = ARIMA(ts_log, order=(0, 1, 2))  
results_MA = model.fit(disp=-1)  
fig,axes = plt.subplots(1,1,figsize=(20,10))
plt.plot(ts_log_diff)
plt.plot(results_MA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_MA.fittedvalues-ts_log_diff)**2))

<a id="22"></a> <br>
6. COMBINED MODEL

In [None]:
model = ARIMA(ts_log, order=(2, 1, 2))  
results_ARIMA = model.fit(disp=-1)  
fig,axes = plt.subplots(1,1,figsize=(20,10))
plt.plot(ts_log_diff)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-ts_log_diff)**2))

Here we can see that the AR and MA models have almost the same RSS but combined is significantly better. Now, we are left with 1 last step, i.e. taking these values back to the original scale.


**In progress ...**

**Be in touch to get last commits ...**

**I'll try to complete it as soon as possible**


<a id="100"></a> <br>
#  3-References

[1. A comprehensive beginner’s guide to create a Time Series Forecast (with Codes in Python).](https://www.analyticsvidhya.com/blog/2016/02/time-series-forecasting-codes-python/)

[2. My Kernel on Kaggle Feature Engineering in TwoSigma Competition.](https://www.kaggle.com/smasar/eda-preprocessing-processing-evaluation)

3. Personal Experieces in Similar Projects.

