# Time series analysis 

Time series is a list of data points indexed in a time order. Pandas object DataFrame and Series are perfect for representing time series since you can assign a time-based index to values, which enables elegant and fast indexing and aggregating. 

In this notebook we will demonstrate:
* basic pandas functions for working with time series
* vizualizations of time series


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
from pandas.plotting import lag_plot
from pandas.plotting import autocorrelation_plot

from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose

### Loading data

Loads data from .csv file and changes column date to datetime type. 

In [None]:
df = pd.read_csv('../data/single_item_data.csv')
df['date'] = pd.to_datetime(df['date'])
df.head()

In [None]:
df_agg = df.groupby('date')['quantity'].sum()
df_agg.head()

### Time series in Pandas

A series of a dataframe (could contain many time series) with a time-based index

### Basic Time Series Plotting

Looking at a graph of time series will help us understand it better. 

In [None]:
def ts_plot(ts):
    f = plt.figure(figsize=(12,5))
    ax = f.add_subplot(1,1,1)
    if isinstance(ts, list):
        for ti, t in enumerate(ts):
            ax.plot(t, label="{}".format(ti+1))
        ax.legend()
    else:
        ax.plot(ts)
    ax.grid(alpha=0.3)
    ax.set_ylabel('quantity')  


In [None]:
ts_plot([df_agg])

### Date and Time in Pandas

The Python standard library includes data types for date and time data, as well as calendar-related functionality. The **datetime.datetime** type, or simply datetime, is widely used. See [Basic date and time types](https://docs.python.org/2/library/datetime.html) for details.

In [None]:
from datetime import datetime
from datetime import timedelta

Function **now()**

In [None]:
datetime.now()

**datetime.timedelta** represents the temporal difference between two datetime objects.

In [None]:
t1 = datetime.now()

In [None]:
t2 = datetime.now()

In [None]:
t2 - t1

In [None]:
t1 - timedelta(days=1000)

<h4>Converting between string and datetime</h4>

datetime objects can be formatted as strings using str or the strftime method, passing a format specification.

In [None]:
t1.strftime('%d.%m.%y')

In [None]:
(t1 - timedelta(days=1000)).strftime('%d.%m.%y')

See the table below for a list of popular format codes. These same format codes can be used to convert strings to dates.

| type | description
|------|---------------------------------------------------------
| %Y   | 4-digit year
| %y   | 2-digit year
| %m   | 2-digit month [01, 12]
| %d   | 2-digit day [01, 31]
| %H   | Hour (24-hour clock) [00, 23]
| %I   | Hour (12-hour clock) [01, 12]
| %M   | 2-digit minute [00, 59]
| %S   | Second [00, 61] (seconds 60, 61 account for leap seconds)
| %w   | Weekday as integer [0 (Sunday), 6]

Indexing time series

In [None]:
df_agg.loc['2013'].head()

In [None]:
df_agg.loc['2013-05-04':'2013-05-12']

Use time delta to calculate relative dates

In [None]:
tstart = df_agg.index.min()
t = tstart + timedelta(days=100)
df_agg.loc[t:t+timedelta(days=5)]

### Resampling data

The method **resample** is a convenient method for frequency conversion and resampling of regular time-series data.

The following <i>offset aliases</i> are given to useful common time series frequencies:

<table align=left>
<tr><td><b>name</b></td><td><b>description</b>                       </td>
<tr><td>B       </td><td>business day frequency                       </td>
<tr><td>C       </td><td>custom business day frequency (experimental) </td>
<tr><td>D       </td><td>calendar day frequency                       </td>
<tr><td>W       </td><td>weekly frequency                             </td>
<tr><td>M       </td><td>month end frequency                          </td>
<tr><td>BM      </td><td>business month end frequency                 </td>
<tr><td>CBM     </td><td>custom business month end frequency          </td>
<tr><td>MS      </td><td>month start frequency                        </td>
<tr><td>BMS     </td><td>business month start frequency               </td>
<tr><td>CBMS    </td><td>custom business month start frequency        </td>
<tr><td>Q       </td><td>quarter end frequency                        </td>
<tr><td>BQ      </td><td>business quarter endfrequency                </td>
<tr><td>QS      </td><td>quarter start frequency                      </td>
<tr><td>BQS     </td><td>business quarter start frequency             </td>
<tr><td>A       </td><td>year end frequency                           </td>
<tr><td>BA      </td><td>business year end frequency                  </td>
<tr><td>AS      </td><td>year start frequency                         </td>
<tr><td>BAS     </td><td>business year start frequency                </td>
<tr><td>BH      </td><td>business hour frequency                      </td>
<tr><td>H       </td><td>hourly frequency                             </td>
<tr><td>T       </td><td>minutely frequency                           </td>
<tr><td>S       </td><td>secondly frequency                           </td>
<tr><td>L       </td><td>milliseonds                                  </td>
<tr><td>U       </td><td>microseconds                                 </td>
<tr><td>N       </td><td>nanoseconds                                  </td>
</table>

In [None]:
df_agg.resample('M').mean().head()

In [None]:
df_agg.resample('M').sum().head()

### Rolling calculations

For working with data, a number of windows functions are provided for computing common window or rolling statistics. Among these are count, sum, mean, median, correlation, variance, covariance, standard deviation, skewness, and kurtosis.

Generally these methods all have the same interface. They all accept the following arguments:

- **window**: size of moving window
- **min_periods**: threshold of non-null data points to require

In [None]:
ts_plot([df_agg.rolling(window=30).mean(), df_agg.resample('M').mean()])

### Visualizations of Time Series

### Histogram and Density Plots

Another important visualization is of the distribution of observations themselves. This means a plot of the values without the temporal ordering. Some linear time series forecasting methods assume a well-behaved distribution of observations (i.e. a bell curve or normal distribution). This can be explicitly checked using tools like statistical hypothesis tests. But plots can provide a useful first check of the distribution of observations both on raw observations and after any type of data transform has been performed.

In [None]:
df_agg.hist();

In [None]:
df_agg.plot(kind='kde');

### Box and Whisker Plots

Histograms and density plots provide insight into the distribution of all observations, but we
may be interested in the distribution of values by time interval. Another type of plot that is
useful to summarize the distribution of observations is the box and whisker plot. This plot
draws a box around the 25th and 75th percentiles of the data that captures the middle 50% of
observations. A line is drawn at the 50th percentile (the median) and whiskers are drawn above
and below the box to summarize the general extents of the observations. Dots are drawn for
outliers outside the whiskers or extents of the data.

In [None]:
df_agg.to_frame().boxplot()


### Lag Scatter Plots

Time series modeling assumes a relationship between an observation and the previous observation.
Previous observations in a time series are called lags, with the observation at the previous time
step called lag=1, the observation at two time steps ago lag=2, and so on. A useful type of plot
to explore the relationship between each observation and a lag of that observation is called the
scatter plot. Pandas has a built-in function for exactly this called the lag plot. It plots the observation at time t on the x-axis and the lag=1 observation (t-1) on the y-axis.

* If the points cluster along a diagonal line from the bottom-left to the top-right of the plot,
it suggests a positive correlation relationship.
* If the points cluster along a diagonal line from the top-left to the bottom-right, it suggests
a negative correlation relationship.
* Either relationship is good as they can be modeled.

More points tighter in to the diagonal line suggests a stronger relationship and more spread
from the line suggests a weaker relationship. A ball in the middle or a spread across the plot
suggests a weak or no relationship.

In [None]:
# create a lag scatter plot
pd.plotting.lag_plot(df_agg);

### Autocorrelation Plots

We can quantify the strength and type of relationship between observations and their lags. In statistics, this is called correlation, and when calculated against lag values in time series, it is called autocorrelation (self-correlation). A correlation value calculated between two groups of numbers, such as observations and their lag=1 values, results in a number between -1 and 1. The sign of this number indicates a negative or positive correlation respectively. A value close to zero suggests a weak correlation, whereas a value closer to -1 or 1 indicates a strong correlation.

Correlation values, called correlation coeficients, can be calculated for each observation and different lag values. Once calculated, a plot can be created to help better understand how this relationship changes over the lag. This type of plot is called an **autocorrelation plot**.

The Statsmodels library provides a version of the autocorrelation plot as a line plot that plots lags on the horizontal and the correlations on vertical axis:
- **Autocorrelation Function (ACF)**: It is a measure of the correlation between the time series with their lagged version.
- **Partial Autocorrelation Function (PACF)**: It measures the correlation between the time series with their lagged version (like ACF), but eliminating the variations already explained by the intervening comparisons.

In [None]:
# Autocorrelation plot from pandas
ax = autocorrelation_plot(df_agg)  # pass subseries of the original series
ax.set_xlim([0, 50])                 # limit x-axis to make it more readable

In [None]:
# Autocorrelation Function (ACF)
plot_acf(df_agg, lags=30);

- Autocorrelation represents the degree of similarity between a given time series and a lagged version of itself over successive time intervals.
- Autocorrelation measures the relationship between a variable's current value and its past values.
- An autocorrelation of +1 represents a perfect positive correlation, while an autocorrelation of negative 1 represents a perfect negative correlation.

In [None]:
# Partial Autocorrelation Function (PACF)
plot_pacf(df_agg, lags=30);