# Week 4 Extension

We are going to go through the following reading list.

- [Nested CV for model selection](https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9)
- [Kaggle Kernel on Kalman Filter](https://www.kaggle.com/jigneshjokhakar1/fast-processing-kalman-filter-vs-arima-model)
- [Comments on refitting and rolling estimation with incremental data](https://towardsdatascience.com/3-facts-about-time-series-forecasting-that-surprise-experienced-machine-learning-practitioners-69c18ee89387)
- [A KF EWMA bagging model on weather forecast improvement](https://www.researchgate.net/publication/233997830_Combination_of_Kalman_filter_and_an_empirical_method_for_the_correction_of_near-surface_temperature_forecasts_Application_over_Greece)
- [KF Application for Lyft ride forecasting](https://eng.lyft.com/how-to-deal-with-the-seasonality-of-a-market-584cc94d6b75)
- [Modeling seasonality with Fourier transforms](https://towardsdatascience.com/analyzing-seasonality-with-fourier-transforms-using-python-scipy-bb46945a23d3)
- [COVID forecast model](https://hdsr.mitpress.mit.edu/pub/ozgjx0yn/release/2?readingCollection=0181d53b)

# Time Series Modeling Practices

## Cross Validation Methods

### Rolling Forecasting Origin

Cross-validation [only works when you can write the model as an autoregression](https://robjhyndman.com/hyndsight/tscv/), but that should contain all models that we concern. The CV process is also known as __evaluation on a rolling forecasting origin__. The forecast accuracy is computed by averaging over the test sets. Specifically, if we are interested in models that produce good $k$-step-ahead forecasts, we can apply the following routine represented in the following pseudo-code:

```python
CV_timeframe = range(s_start, s_end+1-k) # s_start > n_lags
RMSE_CV_df = pd.DataFrame({model: np.zeros_like(CV_timeframe)}, index=CV_timeframe)
for model in model_list:
    for s in CV_timeframe: 
        train, test = data[:s], data[s+k-1]
        model.fit(train)
        RMSE_CV_df.loc[s, model] = RMSE(model.forecast(horizon=k)[-1], test)
# RMS_CV_df.mean(axis=0)
```

Graphically, for $k$ = 4,

<img src="https://robjhyndman.com/files/cv4-1.png" width="70%">


In general, rolling CV does not have to be LOOCV:

<img src="https://i.stack.imgur.com/fXZ6k.png">

#### Blocking CV

Note that however, rolling CV has a potential [information leakage](https://medium.com/@soumyachess1496/cross-validation-in-time-series-566ae4981ce4) problem (think: we use same $\theta$ fit $X_{2k|k}$ and $X_{3k|2k}$. The information of $\{X_{k+1},\ldots,X_{2k}\}$ in the latter information set will feedback to the former). An improvement would be to do a blocking CV, as follows.

<img src="https://miro.medium.com/max/544/1*QJaeOqGfe_vKbpmT882APA.png">

### Nested CV - Day Forward-Chaining for model classes

We also can keep the original train-validation-test split convention as per [this post](https://miro.medium.com/max/700/1*obKmc_bTKbUFgcgryhaAnA.png). Note that it is essential to withhold all data about events that occur chronologically after the events used (as in the figure below), in order to simulate the “real world forecasting environment, in which we stand in the present and forecast the future” (Tashman 2000)

<img src="https://miro.medium.com/max/700/1*obKmc_bTKbUFgcgryhaAnA.png" width="50%">

For non time series models, we rotate the validation fold with the rest of the training folds resulting in an unbiased estimate of the model error (average across rotations). With time series data we cannot do that, as non-chronological order of train and validation will cause data leakage. Instead the following __day forward-chaining__ routine is recommended:

<img src="https://miro.medium.com/max/700/1*2-zaRQ-dsv8KWxOlzc8VaA.png" width="70%">

In pseudo-code,

```python
def model_selection(train, test, model_list):
    CV_error = list()
    for model in model_list:
        model.fit(train)
        CV_error.append(RMSE(model.forecast(horizon=len(validation)), test))
    return model_list[CV_error.index(min(CV_error))]

fold_size = int(T/K)
fold_index = np.arange(0, T, fold_size) # K fold CV
test_error = list()
for k in range(2,K+1):
    train, validation, test = data[:k*fold_size], data[k*fold_size:(k+1)*fold_size], data[(k+1)*fold_size:(k+2)*fold_size]
    selected_model = model_selection(train, validation, model_list)
    selected_model.train(train+validation)
    test_error.append(RMSE(selected_model.forecast(horizon=len(test)), test))
model_error = mean(test_error)
```

Note that the day forward-chaining routine outputs the model error for the super set of models (i.e. aggregated by `model_list`), this is because in each fold the optimal model within the super set is chosen. So the final, unbiased model error on various test set represents the behavior of model within the same _class_. An advantage of doing parameter tuning independently is that information leakage will be avoided - the same $\theta$ that is used to forecast $X_{3k|2k}$ will not be used to train $X_{2k|k}$ (thereby remembering the forecast).

## Refitting models before making each forecast

Excerpt from [this post](https://towardsdatascience.com/3-facts-about-time-series-forecasting-that-surprise-experienced-machine-learning-practitioners-69c18ee89387). 

For most ML models, you train a model, test it, retrain it if necessary until you’ve gotten satisfactory results, and then evaluate it on a hold out data set. After you’re satisfied with the model’s perfromance, you then deploy it to production. Once in production, you score new data as it comes in. Eventually after a few months, you might want to update your model if a significant amount of new training data comes in. Model training is a one time activity, or done at most at periodic intervals to maintain the model’s performance to take int account new information.

For time series models, this is not the case. Instead we have to retrain our model everytime we want to generate a new forecast. Below is an example of the consequence of not updating the model parameters frequently enough:

<img src="https://miro.medium.com/max/2400/1*Oxnt2SVtRk6tSwqQiZWTTA.jpeg" width="50%">

To get an intuitive understanding of why, first consider a classic ML task: Classifying cat images. The visual properties of cats are stable over time (unless we start looking at evolutionary time scales), so when we train a neural network to recognize pictures of cats, an implicit assumption is that the features that define cats are going to remain the same for the foreseeable future. We don't expect cats to look different next week, or next year, or even ten years from now. Given enough data, the model we trained this week is good enough for the foreseeable future as well.

In statistical parlance, we say that the distribution of cat picture features is a stationary distribution, meaning that its properties such as its mean and standard deviation remain the same over time. Now recall that a common pitfall in ML projects happens when the distribution of the development data set and the distribution of the production data set are not the same, causing the model to fail in production. Well for time series, it is almost always the case that the development data set and the production data set are not from the same distribution, because real world business time series (such as the Australian beer sales) data are not stationary, and the statistical properties of your distribution will keep shifting as new actuals come in.

The only way around this is to retrain your model every time you get new data. Note that this is not the same as continuous learning, where an already trained model is updated as new data comes in. You are actually retraining a new model from scratch everytime you want to generate a new forecast (although it would be an interesting research topic to see if continuous learning can be applied to time series forecasting).



# Example: Forecasting COVID with Growth Curves

This article provides a state space model motivated by a dynamic growth model. Based on and contains excerpts from [the Harvard data science review article](https://hdsr.mitpress.mit.edu/pub/ozgjx0yn/release/2?readingCollection=0181d53b). 

Keywords: [generalized logistic](https://en.wikipedia.org/wiki/Generalised_logistic_function), [Gompertz curve](https://en.wikipedia.org/wiki/Gompertz_function), Kalman filter, negative binomial distribution, score-driven models, stochastic trend.

## Context

The authors built a model to forecast COVID, taking into account dynamics in the growth process (up to a class of functional forms) of a pandemic.

## Motivation - Pandemic Dynamics

This case presented in HDSR is different from the time series model-free meothds covered previously. Here the authors present a structural equation approach for dynamic pandemic growth to forecast series in Germany and the UK. Unlike most standard time series models, the new models are able to make good forecasts even before new cases and/or deaths reach their peak. Furthermore they are able to track the epidemic as human behavior changes, and can be used to evaluate the effects of changes in policy.

The authors invoked the class of generalized logistic curves, which are ideal for fitting processes such as the demand for new products, or the growth of mammals subject to resource constraints. In particular, they use the Gompertz curve:

\begin{align*}
\mu(t) = \frac{\bar{\mu}}{1+\gamma_0 e^{-\gamma t}}
\end{align*}

Here $\bar{\mu}$ is the saturated level. $\gamma_0$ and $\gamma$ account for the initial conditions and the growth rates, respectively.

The traditional approach would be to build a statistical model on $\mu_t$ around $Y_t$, the observed states (possibly with a first-order autoregressive disturbance term). One drawback of fitting a deterministic trend is the lack of flexibility, as would be with most economic and social time series. Instead, this model works with the __change or the growth rate__, with specification of the model informed by an assumption that the <u>total follows a growth curve</u>. The saturation level can be continually updated as new observations become available. When logarithms are taken, estimation of the basic models derived from the GL class can be carried out by least squares regression

The authors also noted that when numbers are small, as is the case with deaths at the beginning or end of an epidemic, there is a strong argument for adopting a negative binomial distribution. Models formulated under an assumption of Gaussianity need to be modified accordingly and the article described a score-driven approach described for implementation.

## The model

### GL Growth Model

Suppose that the level $\mu(t)$ satistifies a growth process and differential equation defined by the generalized logistic curve:

\begin{align*}
\mu(t) &= \frac{\bar{\mu}}{\Big( 1+\frac{\gamma_0}{\kappa} e^{-\gamma t} \Big)^\kappa}\\
\frac{d\mu(t)}{dt} &= \gamma \kappa \bigg[ 1-\Big( \frac{\mu(t)}{\bar{\mu}} \Big)^{\frac{1}{\gamma}} \bigg] \mu(t)
\end{align*}

We want to find the growth function $g(t)$ such that $d\mu(t) = g(t) \mu(t)$. We see from above we can refer to the differential equation above. Note that $g$ has an alternative form, notice that taking logarithm, we have $g(t) = \log d\mu(t) - \log \mu(t)$:



\begin{align*}
\frac{d\mu(t)}{dt} &= \frac{\bar{\mu}\gamma\gamma_0 e^{-\gamma t}}{\Big( 1+\frac{\gamma_0}{\kappa} e^{-\gamma t} \Big)^{\kappa + 1}} = \big( \bar{\mu}^{-\frac{1}{\kappa}} \gamma \gamma_0 e^{-\gamma t} \big) \mu(t)^{\frac{\kappa+1}{\kappa}} \\
\log d\mu(t) &= \frac{\kappa+1}{\kappa} \log \mu(t) + \log\big( \bar{\mu}^{-\frac{1}{\kappa}\gamma \gamma_0} \big) - \gamma t\\
g(t) &= \frac{1}{\kappa} \log \mu(t) + \log\big( \bar{\mu}^{-\frac{1}{\kappa}\gamma \gamma_0} \big) - \gamma t
\end{align*}



One can verify that the point of inflection happens at $g(t) = \frac{\gamma\kappa}{\kappa+1}$, or at $\mu^*=\bar{\mu}\big(\frac{\kappa}{\kappa+1} \big)^\kappa$ and $t^* = \log\big(\frac{\gamma_0}{\gamma}\big)$. The Gompertz curve is a special case of GL: when $\kappa \rightarrow \infty$, then $\frac{\kappa}{\kappa+1} \rightarrow 1$. 

In the context of COVID, we replace $\mu(t)$ with $Y_t$, the number of cumulative cases at $t$. Define $y_t = \Delta Y_t = Y_t-Y_{t-1}$. Note that $g(t) = \frac{y_t}{Y_{t-1}}$. Formally, we write $\delta = \frac{1}{\kappa} \log \mu(t)$ and $\rho = \frac{\kappa+1}{\kappa}$,

\begin{align*}
\log y(t) &= \rho \log Y_{t-1} + \delta - \gamma t\\
\log g(t) &= (\rho-1) \log Y_{t-1} + \delta - \gamma t
\end{align*}

### Dynamic Gompertz Model

We incoporate stochastic level and slope components to the Gompertz model:

\begin{align*}
\ln g(t) &= \delta_t + \varepsilon_t & & \varepsilon_t \sim N(0,\sigma_\varepsilon^2)\\
\delta_t &= \delta_{t-1} - \gamma_{t-1} + \eta_t & & \eta_t \sim N(0,\sigma_\eta^2)\\
\gamma_t &= \gamma_{t-1} + \zeta_t & & \zeta_t \sim N(0,\sigma_\zeta^2) \\
\end{align*}

Which can be expressed in state space model representation with state vectors $[\delta_t, \gamma_t]$ with measurement:

\begin{align*}
\ln y_{t|t-1} &= \ln Y_{t-1} +\delta_{t|t-1}
\end{align*}

Estimation can be done in MLE with Kalman Filter.

<img src="https://resize.pubpub.org/fit-in/800x0/lhmr7l8s/11597935983521.jpg" width="75%">

### Negative Binomial Adjustment for small numbers

When $y_t$ is small, it may be better to specify its distribution, conditional on past values, as discrete. The usual choice is the negative binomial, which, when parameterized in terms of a time-varying mean, $\xi_{t|t-1}$ and a fixed positive shape parameter, $\nu$ has probability mass function (PMF):

\begin{align*}
p(y_t|\Psi_{t-1}) = \frac{\Gamma(\nu+y_t)}{y_t!\Gamma(\nu)} \xi^{y_t}_{t|t-1} (\nu+\xi_{t|t-1})^{-y_t} (1+\frac{\xi_{t|t-1}}{\nu})^{-\nu}
\end{align*}

We model $\xi_{t|t-1}$ as:

\begin{align*}
\ln \xi_{t|t-1} = \ln Y_{t-1} + \delta_{t|t-1}
\end{align*}

And subsequently estimate this part with corresponding MLE. Other components such as day of the week and seasonal components are also added to the model.

<img src="https://resize.pubpub.org/fit-in/800x0/7yrt32dp/01597936316664.jpg" width="50%">

# Example: Using Kalman Filter to Improve Weather Predictions

This example demonstrates a bagging model for weather forecast featuring Kalman Filters, and exponential smoothing models. Based on and contains exercpts from [Anadranistakis et al (2002)](file:///C:/Users/JWong/Documents/ISYE%206501/2002GL014773.PDF).

## Context

Given a few sources of numerical forecast of Greece, the authors want to build a model to improve the forecasts.

## Kalman Filters

We have a numerical weather forecast and observed numerical weather variables denoted by $T_{ft}, T_{ot}$ at time $t$. We assume that actual weather is a linear function of the forecast, so $T_{ot} = a + d T_{ft} + v_t$. By writing $b = d-1$ we have, in innovation form, $T_{ot} - T_{ft} = a + b T_{ft} + v_t$. For a realistic correction procedure the parameters $a$
and $b$ are allowed to dynamically evolve with time: $a_t = a_{t-1} + w_{1t}$ and $b_t = b_{t-1} + w_{2t}$. 

Now to represent in terms of a state space model, let the state vector be $x_t = [a_t, b_t]^T$:

\begin{align*}
T_{ot} - T_{ft} &= [1, T_{ft}]x_t + v_t & & \text{measurement} \\
x_t &= x_{t-1} + w_t & & \text{system} \\
v_t &\sim N(0, \Sigma^v_t) \\
w_t &\sim N(0, \Sigma^w_t)
\end{align*}

The authors proposed a second Kalman filter model, in which $T_{ot}-T_{ft}$ is assumed to be independent of the forecast temperature $T_{ft}$ but slowly varying with time. This is expressed by letting $b_t = 0$, i.e. the state vector reduces to just $x_t = a_t$. The covariance matrices $\Sigma^v_t$, $\Sigma^w_t$ are allowed to vary with time to account for
rapid adjustment when there are external changes. In order to account for the changes of $\Sigma^v_t$, $\Sigma^w_t$ over time, our estimation is based on a rolling window of 25 days. This period is sufficiently long in order to have a good estimate but also sufficiently short so that our estimates adapt dynamically to any changes to $\Sigma^v_t$, $\Sigma^w_t$.

Instead of estimating the initial states for the Kalman Filters, the authors initialized state estimate $x_{0|0} = 0$. Recall that if this is chosen, we need to assign a sufficiently large initial covariance matrix to penalize wild guesses. The authors assigned $P_{0|0} = 4$ to represent the large uncertainty of the initial estimate and the increased weight given to the observations.

## Exponentially Weighted Moving Average

The authors proposed a EWMA model to account for the data generating process of the next period's weather forecasts as a weighted average of the current observed and current forecast weather variables:

\begin{align*}
T_{ft} &= cT_{o,t-1} + (1-c) T_{f,t-1} \\
       &= cT_{o,t-1} + c(1-c) T_{o,t-2} + \cdots + c(1-c)^{t-2}T_{o1} = \sum_{i=1}^{t-1} c(1-c)^{i-1} T_{o,t-i}
\end{align*}

The weight $c$ is chosen to be 0.6.

## Bagging

Let $S^j$ be the forecast error over the n previous periods for estimator $j$. We form the bagged final weather prediction as:

\begin{align*}
W^j &= \frac{\frac{1}{S^j}}{\sum_k \frac{1}{S^k}}\\
\hat{T}_{ft} &= \sum_{j=1}^3 W^j T_{ft}^j
\end{align*}

Note that the harmonic weights are inversely proportional to the forecast errors, so the accurate models are assigned higher weights. The bagged final model is known to improve the forecast accruracy from an error of around 6 degree Celsius to 1.4.

# Example: Time Series forecasting of Lyft Rides

This example showcases various techniques to build models on real life business data. Contains excerpts and based on [this Lyft engineering team blog](https://eng.lyft.com/how-to-deal-with-the-seasonality-of-a-market-584cc94d6b75).

## Context

The Lyft engineering team devised a time series model to determine incentive policies aiming to clear the market.

## Using Kalman Filter to estimate weekly seasonality

Recall that the Holt-Winters exponential smoothing function can be expressed as:

\begin{align*}
\hat{Y}_{t+h|t} &= (l_t + hb_t) s_{t+h-m(k+1)} \\ 
\end{align*}

Here we assume the number of rides $Y_t$ purely arises from trends, and seasonalities. Also we take logarithm so that the seasonality becomes additively separable:

\begin{align*}
\log Y_{t+1|t} = \log b_{w(t+1)} + \log s_{t-6}
\end{align*}

It was noted that seasonal effects are weekly periodic - hence predicted $s_{t-6}$ at $t+1$, and days on the same week shared the same trends - hence predicted $b_{w(t+1)}$ at $t+1$. For example, a primitive way would be take a moving average for $b_t$ (e.g. previous 7 days) and an average on day of the week, which of course wouldn't be optimal because of dynamic seasonality, etc. 

The authors opted to use Kalman Filter over the state space representation, to estimate the weekly seasonality. Here's a synopsis of their KF approach in relation to the textbook GPS example:

<img src="https://miro.medium.com/max/1000/1*9gLLooXf6pR3P2hlgwAvoQ.png" width="65%">

Initial states, for example, the the observation during the first week, were guessed instead of estimated. The follwing detrended and deseaonalized rides are obtained:

<img src="https://miro.medium.com/max/1000/1*1KnuGAAQyIVt2qO7R0XysA.png">

## Ridge regression to estimate yearly seasonalities

Note that the smoothed series output by the Kalman Filter with weekly trend and day of week effects still exhibits some irregular strcutures - indicating that there are sill variation unexplained by the trend decomposition. The authors attributed these variation to punctual events and yearly seasonality (winter vs summer, classes starting, etc). To account for these yearly seasonalities, the residuals from the first step $Z(t)$ was regressed on event dummies, and a Fourier decomposition that describes the yearly seasonalities. A L2-regularization (ridge regression) is applied to avoid overfitting.

\begin{align*}
Z(t) &= \beta_{christmas} I\{christmas_t=1\} + \cdots + \beta_{easter} I\{easter_t=1\} + \sum_{i=1}^{10} \bigg\{ \phi_i \cos\big( \frac{2\pi i}{N}t \big) + \theta_i \sin\big( \frac{2\pi i}{N}t \big)\bigg\} \\
\beta^* &= \arg \min_{\beta} \Vert W(Z-X\beta) \Vert^2 + \Vert \Lambda \beta \Vert^2
\end{align*}

Here $W$ is a a diagonal matrix; when the diagonal value is set to 0, it basically cancels some term in the regression. It is very useful when we want to ignore a period in time, for example during a hurricane. The regressor matrix $X$ can be modified to account for the decreasing impact of seasonality with time (e.g. decreasing the magnitudes for later observations), due to the growth of the market. 

<img src="https://miro.medium.com/max/1000/1*AnbknD7mj3Iama9LzTEAnQ.png">

We can apply a Kalman Filter on the deseasonalized series to obtain a smooth trend.

<img src="https://miro.medium.com/max/1000/1*25fjFFWB__EbRrGDpboOlg.png" width="75%">