# Time Series

The problems in this notebook correspond to the concepts covered in `Lectures/Supervised Learning/Time Series`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

##### 1. Happy birthday!

Write a `list` comprehension or `for` loop that contains every `datetime` for your birthday from the day you were born to now. Sorry if you were born on a leap day!

In [None]:
## code here




In [None]:
## code here




##### 2. Load data

- Load the data, `bike_google_trends.csv` from the `data` folder, set aside the last 12 observations as a test set.

- Load the data, `goog_trend_pumpkin.csv` from the `data` folder, set aside the last 12 observations as a test set.

In [None]:
bike = pd.read_csv("../../data/bike_google_trends.csv", parse_dates = ['Month'])

In [None]:
bike_test = bike.tail(12).copy()
bike_train = bike.drop(bike_test.index).copy()

In [None]:
pumpkin = pd.read_csv("../../data/goog_trend_pumpkin.csv", parse_dates = ['Month'])

In [None]:
pump_test = pumpkin.tail(12).copy()
pump_train = pumpkin.drop(pump_test.index).copy()

##### 3. Seasonal differencing

In lecture we talked about first differencing non-stationary time series exhibiting a trend to create a, seemingly, stationary time series.

This can also be done for seasonal data. Suppose that we suspect a time series, $\left\lbrace y_t \right\rbrace$ exhibits seasonality where a season lasts $m$ time steps. Then the first seasonal differenced time series is:

$$
\nabla y_t = y_t - y_{t-m}.
$$

Plot the autocorrelation of the `bike_train` data set, then perform first differencing on these data and plot the autocorrelation of the first differenced series.

Does the differenced series appear less likely to violate stationarity?

In [None]:
## code here




In [None]:
## code here




In [None]:
## code here




##### 4. Plotting pumpkins

Plot the training data for the `goog_trend_pumpkin.csv` data.  In particular, plot the `pumpkin_trend` over time.

In [None]:
## code here



##### 5. Baselines for seasonality AND trend

In `Lectures/Supervised Learning/Time Series Forecasting/4. Baseline Forecasts` we demonstrated six unique forecasts, none of which account for data with seasonality and trend. Here we will demonstrate two more baselines that do account for those.

##### a. Seasonal average with trend

The first just adds a trend component to the seasonal average baseline forecast:

$$
f(t) = \left\lbrace \begin{array}{l c c}\frac{1}{\left\lfloor n/m \right\rfloor + 1} \sum_{i=0}^{\left\lfloor n/m \right\rfloor} y_{t\%m + i*m} + \beta (t-\frac{n}{2}), & \text{for} & t > n  \\
                                        y_t & \text{for} & t\leq n
                                        \end{array}\right.,
$$

where you can estimate $\beta$ with the average value of the first seasonal differences discussed in 3. above.

##### b. Seasonal naive with trend

The second adds a trend component to the seasonal naive forecast:

$$
f(t) = \left\lbrace \begin{array}{l c c}y_\tau + \beta(t-n), & \text{for} & t > n  \\
                                        y_t & \text{for} & t\leq n
                                        \end{array}\right.,
$$

where 

$$
\tau = t - \left(\left\lfloor \frac{t - n}{m} \right\rfloor + 1\right) m, \text{ with } \lfloor \bullet \rfloor \text{ denoting the floor function.}
$$


Plot both forecasts along with the training and test data for the `goog_trend_pumpkin.csv` `pumpkin_trend` column.

In [None]:
##### code here




In [None]:
### code here





In [None]:
##### code here




In [None]:
### code here




##### 6. Periodograms

While we can sometimes tell the length of a full cycle for periodic data through visual inspection of the time series or its autocorrelation plot, that is not always possible.

Another tool we can use to identify the number of time steps in a single cycle is the <i>periodogram</i>. Here we mention some of the theory and show how to make and interpret a periodogram using python.

A periodogram first fits the following sum of trigonometric functions:

$$
a_0 + \sum_{p = 1}^{n/2 - 1} \left( a_p \cos\left(2\pi \frac{pt}{n} \right) + b_p \sin\left( 2\pi \frac{pt}{n} \right) \right) + a_{N/2} \cos\left(\pi t \right)
$$

using fast Fourier transforms and then plots $R_p^2 = a_p^2 + b_p^2$ against the frequency for each value of $p$.  

If we recall from trigonometry, for:

$$
A \cos\left(2\pi \omega t \right),
$$

$A$ gives the amplitude and $\omega$ gives the frequency. So a larger value of $R_p^2$ indicates that the amplitude on the trigonometric functions at frequency $p$ must be larger, and thus contributes more to the sum. You can then use the fact that $1/\omega = \text{the period of the trig function}$, to guess what the period of the time series may be. 

You can make a periodogram with `scipy`, <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.periodogrdeam.html">https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.periodogrdeam.html</a>.

I will demonstrate with the `bike_train` data set and you should try to make a periodogram for the training data of the `pumpkin` data set.

What seems to be the period of these data according to the periodogram?

In [None]:
from scipy.signal import periodogram

In [None]:
## call periodogram then input then the time series
## the first array returned are the frequencies
## the second the value of R_p^2
periodogram(bike_train.bike_interest)

In [None]:
plt.figure(figsize=(18,6))

plt.scatter(periodogram(bike_train.bike_interest)[0],
                   periodogram(bike_train.bike_interest)[1])

plt.xlabel("$\omega$", fontsize=18)
plt.xticks(fontsize=14)

plt.ylabel("$R_p^2$", fontsize=18)
plt.yticks(fontsize=14)

plt.title("Bike Google Trend Periodogram", fontsize=20)

plt.show()

In [None]:
## You should round to the nearest time step hered
1/(periodogram(bike_train.bike_interest)[0][np.argmax(periodogram(bike_train.bike_interest)[1])])

In [None]:
### code here




In [None]:
## code here



##### References for 6.

To read more about this so called <i>spectral analysis</i> check out:

<a href="https://mybiostats.files.wordpress.com/2015/03/time-series-analysis-and-its-applications-with-examples-in-r.pdf">Time Series Analysis & its Applications</a>, by Robert H. Shumway and David S. Stoffer.

##### 7. SARIMA

We can fit SARIMA using the `SARIMAX` model object from `statsmodels`. Below I demonstrate how with the `bike` data set and you will do so using the `pumpkin` data set.

We also demonstrate the use of `auto_arima` from `pmdarima` which does a hyperparameter search by minimizing AIC.

In [None]:
from statsmodels.tsa.api import SARIMAX
from pmdarima import auto_arima

In [None]:
auto_arima(bike_train.bike_interest.values, trace=True, seasonal=True, m=12)

In [None]:
sarima = SARIMAX(bike_train.bike_interest.values,
                    order = (1,0,2),
                    seasonal_order = (1,0,1,12)).fit()

In [None]:
plt.figure(figsize=(16,6))

plt.plot(bike_train.Month,
            bike_train.bike_interest,
            'b',
            label="Training Set")

plt.plot(bike_train.Month[12:],
            sarima.fittedvalues[12:],
            c='green',
            label="Fit on Training")

plt.plot(bike_test.Month,
            bike_test.bike_interest,
            'r',
            label="Test Set")

plt.plot(bike_test.Month,
            sarima.forecast(len(bike_test)),
            'r.-',
            label="Forecast")

plt.legend(fontsize=14, loc=2)


plt.xlabel("Date", fontsize=16)
plt.ylabel("Bike Google Trend Interest", fontsize=16)

plt.show()

In [None]:
## code here



In [None]:
## code here




--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2023.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erd≈ës Institute as subject to the license (see License.md)