# Simple forecasting techniques

## The importance of a baseline model

In this notebook we will explore some simple forecasting techniques.  Selecting on of these simple techniques should on of your early decisions in a time series forecasting project. Although each represents simple approach to forecasting they are from a family of techniques used for setting a statistical baseline. Before you move onto complex methods make sure you use a benchmark or baseline.  Any complex model must be better than the baseline to be considered for forecasting.  This is a often a missed step in forecasting where there is a temptation to use complex methods.

The methods we will explore are:

* Naive Forecast 1
* Seasonal Naive

We will also briefly introduce **Prediction Intervals** and measuring **forecast error**

# Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.style as style
style.use('ggplot')

import sys

import warnings
warnings.filterwarnings('ignore')

# forecast-tools

In [None]:
# install forecast-tools if running in Google Colab
if 'google.colab' in sys.modules:
    !pip install forecast-tools

# Helper functions

In [None]:
def preds_as_series(data, preds):
    '''
    Helper function for plotting predictions.
    Converts a numpy array of predictions to a 
    pandas.DataFrame with datetimeindex
    
    Parameters
    -----
    preds - numpy.array, vector of predictions
    start - start date of the time series
    freq - the frequency of the time series e.g 'MS' or 'D'
    '''
    start = pd.date_range(start=data.index.max(), periods=2, freq=data.index.freq).max()
    idx = pd.date_range(start=start, periods=len(preds), freq=data.index.freq)
    return pd.DataFrame(preds, index=idx)

# The ED arrivals dataset.

The dataset we will use represent daily adult (age > 18) arrivals to an Emergency Department.  The simulated observations are based on attendences at a real emergency department between Jan 2017 and Dec 2017.

In [None]:
# the url link means you are downloading this data directly from the HSMA github repo.
url = 'https://raw.githubusercontent.com/hsma4/module_9_a/main/data/' \
        + 'ed_attends.csv'
ed_daily = pd.read_csv(url, parse_dates=True, index_col='date')
ed_daily.index.freq = 'D'

In [None]:
ed_daily.index.min()

In [None]:
ed_daily.index.max()

In [None]:
ed_daily.shape

### Visualise the time series

In [None]:
_ = ed_daily.plot(figsize=(12,4))

# Naive Forecast 1

An alternative and popular baseline forecast is Naive Forecast 1.  This simply takes the last value in the time series and extrapolates it forward over the forecast horizon.  I.e.

**Naive Forecast = Last value in the time series**

In mathematical notation:

$\hat{y}_{T+h|T} =y_t$

PenCHORD has implemented some simple classes for baseline forecasts in a package called `forecast-tools`.  

For a Naive 1 forecast the class to use is

```python
forecast_tools.baseline.Naive1
```

There are three steps to use it

* Create an instance of the class
* Call the `fit` method and pass in the historical data
* Call the `predict` method and pass in a chosen forecast horizon e.g. 28 days. 

In [None]:
from forecast_tools.baseline import Naive1

In [None]:
nf1 = Naive1()
nf1.fit(ed_daily)
nf1_preds = nf1.predict(horizon=28)

In [None]:
nf1_preds

Let's visualise the forecast relative to the training data.

* Do you think this is a good baseline?

In [None]:
ax = ed_daily.plot(figsize=(12,4))
nf1.fittedvalues.plot(ax=ax, linestyle='-.')
preds_as_series(ed_daily, nf1_preds).plot(ax=ax)
_ = ax.legend(['train', 'Naive1'])

## Seasonal Naive

Seasonal Naive extends Naive1 in an attempt to incorporate the seasonality in the data.  Instead of carrying the final value in the time series forward it carries forward the value from the previous time period.  As we are working with monthly data this means that a forecast for Janurary will use the previous Janurary's observation. A forecast for February will use the previous February's observation and so on.

In [None]:
from forecast_tools.baseline import SNaive

In [None]:
snf = SNaive(period=7)
snf.fit(ed_daily)
snf_preds = snf.predict(horizon=28)

In [None]:
snf_preds

In [None]:
ax = ed_daily.plot(figsize=(12,4))
snf.fittedvalues.plot(ax=ax, linestyle='-.')
preds_as_series(ed_daily, snf_preds).plot(ax=ax)
_ = ax.legend(['train','Fitted Model', 'SNaive Forecast'])

## Prediction Intervals

To return a prediction interval from a baseline forecast object use:

```python
y_preds, y_intervals = model.predict(horizon, return_predict_int=True)
```

By default this returns 80% and 90% PIs.  

To return only the 80% intervals use:

```python
y_preds, y_intervals = model.predict(horizon, 
                                     return_predict_int=True, 
                                     alpha=[0.2])
```

To return, the 80, 90 and 95% intervals use:


```python
y_preds, y_intervals = model.predict(horizon, 
                                     return_predict_int=True, 
                                     alpha=[0.2,0.1,0.05])
```

In [None]:
snf = SNaive(period=7)
snf.fit(ed_daily)
y_preds, y_intervals = snf.predict(horizon=6, return_predict_int=True, 
                                    alpha=[0.2, 0.05])
y_intervals[1]

In [None]:
def plot_prediction_intervals(train, preds, intervals, test=None):
    '''
    Helper function to plot training data, point preds
    and 2 sets of prediction intevals
    
    assume 2 sets of PIs are provided!
    '''
    ax = train.plot(figsize=(12,4))

    mean = preds_as_series(train, preds)
    intervals_80 = preds_as_series(train, intervals[0])
    intervals_90 = preds_as_series(train, intervals[1])

    mean.plot(ax=ax, label='point forecast')

    ax.fill_between(intervals_80.index, mean[0], intervals_80[1], 
                    alpha=0.2,
                    label='80% PI', color='yellow');

    ax.fill_between(intervals_80.index,mean[0], intervals_80[0], 
                    alpha=0.2,
                    label='80% PI', color='yellow');

    ax.fill_between(intervals_80.index,intervals_80[1], intervals_90[1], 
                    alpha=0.2,
                    label='90% PI', color='purple');

    ax.fill_between(intervals_80.index,intervals_80[0], intervals_90[0], 
                    alpha=0.2,
                    label='90% PI', color='purple');
    
    if test is None:
        ax.legend(['train', 'point forecast', '80%PI', '_ignore','_ignore', 
                   '90%PI'], loc=2)
    else:
        test.plot(ax=ax, color='black')
        ax.legend(['train', 'point forecast', 'Test', '80%PI', '_ignore',
                   '_ignore', '90%PI'], loc=2)
    
    

In [None]:
plot_prediction_intervals(ed_daily[-60:], y_preds, y_intervals)

# Measuring Point Forecast Error.

## A basic train test split

Let's hold back 28 days of data and calculate the forecast error of Seasonal Naive.

`forecast-tools` helps you calculate forecast error with the `forecast_tools.metrics.forecast_errors` function.  The function calculates a range of metrics.

In [None]:
train_length = ed_daily.shape[0] - 28
train, test = ed_daily.iloc[:train_length], ed_daily.iloc[train_length:]

In [None]:
train.shape

In [None]:
test.shape

### IMPORTANT - DO NOT LOOK AT THE TEST SET!

We need to **hold back** a proportion of our data.  This is so we can simulate real forecasting conditions and check a models accuracy on **unseen** data.  We don't want to know what it looks like as that will introduce bias into the forecasting process and mean we overfit our model to the data we hold.

**Remember - there is no such thing as real time data from the future!**

In [None]:
snf = SNaive(period=7)
preds = snf.fit_predict(train, horizon=28)
preds

In [None]:
from forecast_tools.metrics import forecast_errors

forecast_errors(test, preds)

In [None]:
nf1 = Naive1()
preds = nf1.fit_predict(train, horizon=28)
preds

In [None]:
forecast_errors(test, preds)