# Basics of Time Series Analysis

```{admonition} Attribution
This notebook is based on the kernel [TS-0: the basics
](https://www.kaggle.com/konradb/ts-0-the-basics) by [GM Konrad Banachewicz](https://www.kaggle.com/konradb).  
```

This notebook summarizes some **elementary methods** for time series analysis as sometimes you don't have the time, hardware, or data to go for a transformer, and vintage methods can be your friend. 

We start by importing the necessary libraries. Most of them are familiar to anybody working with the data science, with the exception of `statsmodels` ([link](https://www.statsmodels.org/stable/index.html)). This package is a product of impressive work by [Seabold and Perktold](http://conference.scipy.org/proceedings/scipy2010/pdfs/seabold.pdf) &mdash; two people who set out to bring statistical functionality in Python into the 21st century. If you are likely to use statistics in your work and you are a Pythonista, familiarizing yourself with this library is a very good idea. 

In this module we are merely scratching the surface of `statsmodel` functionality, with seasonal decomposition as our primary tool.


In [7]:
import pandas as pd
import numpy as np
from random import gauss
from random import random
import warnings
import itertools
import pathlib

import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.ar_model import AR
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller

import seaborn as sns
from pandas.plotting import autocorrelation_plot
import matplotlib.pyplot as plt
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg', 'pdf')

Define a config class.

In [27]:
# config
class CFG:
    data_path = pathlib.Path().absolute().parents[2] / "data"
    img_dim1 = 8
    img_dim2 = 5
    
# adjust the parameters for displayed figures    
plt.rcParams.update({'figure.figsize': (CFG.img_dim1, CFG.img_dim2)})

## Groundwork

Time series is any sequence you record over time and applications are everywhere. More formally, time series data is a sequence of data points (or observations) recorded at different time intervals - those intervals are frequently, but not always, regular (hourly, daily, weekly, monthly, quarterly etc):

$$\{X_t\} \quad t = 1, 2, \ldots, T$$

A strict formulation would be that a time series is a (discrete) realization of a (continuous) [stochastic process](https://en.wikipedia.org/wiki/Stochastic_process) generating the data and the underlying reason why we can infer from the former about the latter is the [Kolmogorov extension theorem](https://en.wikipedia.org/wiki/Kolmogorov_extension_theorem). The proper mathematical treatment of this theory is way beyond the scope of this notebook, so a mathematically inclinded reader is advised to look up those terms and then follow the references.

Phenomena measured over time are everywhere, so a natural question is: what can we do with time series? Some of the more popular applications are:

* **Interpretation.** We want to be able to make sense of diverse phenomena and capture the nature of the underlying dynamics. 

* **Modelling.** Understanding inherent aspects of the time series data so that we can create meaningful and accurate forecasts. 

* **Forecasting.** We want to know something about the future. 

* **Filtering or smoothing.** We want to get a better understanding of the process based on partially or fully observed sample. 

* **Simulation.** In certain applications calculating e.g. high quantiles of a distribution is only possible with simulation, because there is not enough historical data.

## Patterns

The first we can do to identify patterns in a time series is separate it into components with easily understandable characteristics:

$$X_t = T_t + S_t + C_t + I_t$$

where $T_t$ is the **trend** shows a general direction of the time series data over a long period of time which represents a long-term progression of the series (secular variation). $S_t$ is the **seasonal component** with fixed and known period. It is observed when there is a distinct repeated pattern observed between regular intervals due to seasonal factors: annual, monthly or weekly. Obvious examples include daily power consumption patterns or annual sales of seasonal goods. $C_t$ called the **cyclical component** which represents repetitive pattern which does not occur at fixed intervals &mdash; usually observed in an economic context like business cycles. $I_t$ is the **irregular component** (residuals) consists of the fluctuations in the time series that are observed after removing trend and seasonal or cyclical variations.



We may have different combinations of trends and seasonality. Depending on the nature of the trends and seasonality, a time series can be modeled as an additive or multiplicative time series. Each observation in the series can be expressed as either a sum or a product of the components. It is worth pointing out that an alternative to using a multiplicative decomposition is to first transform the data until the variation in the series appears to be stable over time, then use an additive decomposition. This is equivalent to using an additive decomposition with log transformed components. 

A popular implementation for calculating the fundamental decomposition can be used via the statsmodels package:

In [29]:
help(seasonal_decompose)

Help on function seasonal_decompose in module statsmodels.tsa.seasonal:

seasonal_decompose(x, model='additive', filt=None, period=None, two_sided=True, extrapolate_trend=0)
    Seasonal decomposition using moving averages.
    
    Parameters
    ----------
    x : array_like
        Time series. If 2d, individual series are in columns. x must contain 2
        complete cycles.
    model : {"additive", "multiplicative"}, optional
        Type of seasonal component. Abbreviations are accepted.
    filt : array_like, optional
        The filter coefficients for filtering out the seasonal component.
        The concrete moving average method used in filtering is determined by
        two_sided.
    period : int, optional
        Period of the series. Must be used if x is not a pandas object or if
        the index of x does not have  a frequency. Overrides default
        periodicity of x if x is a pandas object with a timeseries index.
    two_sided : bool, optional
        The moving a