# Time Series Analysis Workshop Session 1: Fundamentals

###### Adam Hussain - Head of Talent Development, Imperial College Data Science Society

In this workshop series we will explore some useful techniques for analysing time series data, utlilising python packages such as `pandas`, `statsmodels`, and `numpy`.

The scope of what can be achieved with time series analysis spreads far beyond this short workshop series, and some techniques utilise some advanced mathematical concepts. In this series, any complex mathematics will be omitted to ensure the material is accessible to those taking any degree course.

## Topics Covered:

- Brief recap on `pandas`
- Plotting time series
- Rolling values
- Resampling
- Decomposition

## Importing `pandas`, Reading in Data

`pandas` is an extremely popular package, well-suited to working with tabluar data and has many useful built-in features, particularly for time-series data.

View the user guide [here](https://pandas.pydata.org/docs/user_guide/index.html)

**Useful:** for new users, see [10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html#min)

In [8]:
import numpy as np
import pandas as pd

In [10]:
# RUN THIS IF YOU HAVE DOWNLOADED THE FILE

df = pd.read_csv('./data')
display(df)

IsADirectoryError: [Errno 21] Is a directory: './'

### Rename the columns

In [None]:
df.columns = ['date', 'amount']

NameError: name 'df' is not defined

In [None]:
display(df)

NameError: name 'df' is not defined

### Convert the date column to datetime format
This is necessary because python can interpret datetime objects, but can't interpret arbitrary strings representing a date.

Compare the DataFrame before and after altering the date column. You can read up on datetime [here](https://docs.python.org/3/library/datetime.html).

In [None]:
df['date'] = pd.to_datetime(df['date'])
display(df)

### Set the date as the index

This is convenient for some uses later on. The default indexing on the left of the DataFrame is arbitrary and we do not need it.

Observe the difference before and after we do this.

In [None]:
df = df.set_index('date')
display(df)

## Plotting our time series

### Plotting over the whole range

In [None]:
import matplotlib.pyplot as plt

date = df.index
amount = df['amount']

plt.figure(figsize=(12,8))
plt.plot(date, amount)

plt.title('Monthly Production of Electricity in the US')
plt.ylabel('Amount produced')
plt.show()

### Plotting over a few years

In [None]:
# create an array of boolean values (True or False)
# those familiar with pandas

ind = [i.year in [1995, 1996, 1997] for i in df.index]

In [None]:
date_sample = date[ind]
amount_sample = amount[ind]
plt.figure(figsize=(12,7))
plt.plot(date_sample, amount_sample)
plt.title('Monthly Production of Electricity in the US (1995 to 1997 Inclusive)')
plt.ylabel('Amount produced')
plt.xticks(rotation=-45);

# Rolling Values

## Rolling Average

Taking a rolling average is a good way to observe the general trend of a time series.
In our case we use a window size of 12 (12 months in a year), removing the seasonal fluctuation and hence smoothing the data.

Picture sliding a window along the data, and taking the mean of all the values in the window.

Intuitively, this leaves us with nan values for the first 11 values, but this is not a problem for plotting - python simply ignores them. Alternatively, we can use the `min_periods` argument.

The choice of size of window is problem-specific.

Read more about rolling values [here (documentation)](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html)

See more examples using the `rolling()` method [here](https://sparkbyexamples.com/pandas/pandas-rolling-mean-average-sum/)

In [None]:
rolling_ave = amount.rolling(12, min_periods=1).mean()

In [None]:
plt.figure(figsize=(12,7))
plt.plot(date, amount, label='Raw data')
plt.plot(date, rolling_ave, label='12 month rolling average')
plt.legend()

## Rolling Standard Deviation

We can see whether, over time, the production of electricity varies more or less. This can be useful if you want to identify periods where a variable is more stable and more volatile.

In [None]:
plt.figure(figsize=(12,6))

# plotting the standard devation over rolling 12 month periods

plt.subplot(121)
rolling_std = amount.rolling(12).std()
plt.plot(date, rolling_std, label='Standard Deviation')
plt.legend()

# plotting the normalised standard deviation over rolling 12 month periods

plt.subplot(122)
rolling_std_norm = rolling_std / rolling_ave
plt.plot(date, rolling_std_norm, label='Normalised Standard Deviation')
plt.legend()

## Resampling

Resampling is useful if we have data with a given sampling rate, but we are interested in a different timescale.
Below, we resample our data quarterly, and then yearly. Again, this can get rid of small fluctuations which you are not interested in.

For example, if you are analysing some financial data, quarterly resampling may come in handy.

### Quarterly

In [None]:
df_q = df.resample('Q').sum()

quarters = df_q.index
amount_q = df_q['amount']

df_q

In [None]:
plt.figure(figsize=(12,7))
plt.plot(quarters, amount_q)
plt.title('Quarterly Production of Electricity in the US')
plt.ylabel('Amount produced')
plt.show()

### Notice the drop - why?

We are resampling dataset of **monthly** data into 3-month periods, but our dataset has 397 entries.

Dividing 397 by 3 gives a remainder of 1, so the final "quarterly" amount only contains 1 month's worth.

In the plot below, we have omitted this artefact, by taking every entry except the final one.

In [None]:
print(397%3)

In [None]:
plt.figure(figsize=(12,7))
plt.plot(quarters[:-1], amount_q[:-1])
plt.title('Quarterly Production of Electricity in the US')
plt.ylabel('Amount produced')
plt.show()

### Annually

In [None]:
df_y = df.resample('Y').sum()

years = df_y.index
amount_y = df_y['amount']

# .head() prints the first 5 values of the DataFrame
df_y.head()

In [None]:
plt.figure(figsize=(12,7))
plt.plot(years, amount_y)
plt.title('Yearly Production of Electricity in the US')
plt.ylabel('Amount produced')
plt.show()

### Notice the drop (again) - why?

We are resampling dataset of **monthly** data into 12-month periods, but our dataset has 397 entries.

Dividing 397 by 12 gives a remainder of 1, so the final "quarterly" amount only contains 1 month's worth.

In the plot below, we have omitted this artefact, by taking every entry except the final one.

In [None]:
print(397%12)

In [None]:
plt.figure(figsize=(12,7))
plt.plot(years[:-1], amount_y[:-1])
plt.title('Yearly Production of Electricity in the US')
plt.ylabel('Amount produced')
plt.show()

## Seasonal Decomposition

This is a technique which considers a time series to have 3 components:

- Trend
- Seasonality
- Residuals

Here we consider 2 types of decomposition: **additive** and **multiplicative**.

Additive decomposition treats the time series as the **sum** of the above 3 components.

Multiplicative decomposition treats the series as the **product** of the above 3 components.

### We will use the `seasonal_compose` function from `statsmodels.tsa`
View the documentation [here](https://www.statsmodels.org/dev/generated/statsmodels.tsa.seasonal.seasonal_decompose.html).

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose

## Additive Decomposition

In [None]:
decomp_a = seasonal_decompose(df, model='additive')

`decomp_a` is a `DecomposeResult` object with attributes `trend`, `seasonal` and `resid`, each of which is a `Series` object

In [None]:
trend_a = decomp_a.trend
seasonal_a = decomp_a.seasonal
resid_a = decomp_a.resid

### Plotting the Results of Additive Decomposition

Below we plot the raw data, along with the 3 components of the seasonal decomposition.

In [None]:
plt.figure(figsize=(12,9))
plt.suptitle('Additive Decomposition')

plt.subplot(411)
plt.plot(date, amount, label='raw data')
plt.legend(loc='upper left')

plt.subplot(412)
plt.plot(date, trend_a, label='trend')
plt.legend(loc='upper left')

plt.subplot(413)
plt.plot(date, seasonal_a, label='seasonality')
plt.legend(loc='upper left')

plt.subplot(414)
plt.plot(date, resid_a, label='residuals')
plt.legend(loc='upper left')

## Multiplicative Decomposition

`decomp_a` is a `DecomposeResult` object with attributes `trend`, `seasonal` and `resid`, each of which is a `Series` object

In [None]:
decomp_m = seasonal_decompose(df, model='multiplicative')

In [None]:
trend_m = decomp_m.trend
seasonal_m = decomp_m.seasonal
resid_m = decomp_m.resid

### Plotting the Results of Multiplicative Decomposition

Below we plot the raw data, along with the 3 components of the seasonal decomposition.

In [None]:
plt.figure(figsize=(12,9))
plt.suptitle('Multiplicative Decomposition')

plt.subplot(411)
plt.plot(date, amount, label='raw data')
plt.legend(loc='upper left')

plt.subplot(412)
plt.plot(date, trend_m, label='trend')
plt.legend(loc='upper left')

plt.subplot(413)
plt.plot(date, seasonal_m, label='seasonality')
plt.legend(loc='upper left')

plt.subplot(414)
plt.plot(date, resid_m, label='residuals')
plt.legend(loc='upper left')