# DSO 522: Applied Time Series Analysis for Forecasting

## Week 6: Time series regression models

### Fall 2024

#### Instructor: Dr. Matteo Sesia


<p align="center">
  <img src="img/marshall.png" alt="Marshall School of Business" width="600"/>
</p>

<link rel="stylesheet" type="text/css" href="custom.css">

# Interactive slides

These lecture slides are made using an interactive [Jupyter](https://jupyter.org/) notebook, powered by the [RISE](https://rise.readthedocs.io/en/latest/) extension.

In the lectures, we will run `R` code in Jupyter, using the `fpp3` package (which you should have already installed).

In [1]:
suppressMessages(library(fpp3))

library(repr)
options(repr.matrix.max.rows=4)
options(repr.plot.width = 8, repr.plot.height = 4, repr.plot.res = 250)

## The linear model for time series forecasting

The basic concept is that we forecast the time series of interest $y$ assuming that it has a linear relationship with other time series $x$.

For example, we might wish to forecast monthly sales $y$ using total advertising spend $x$ as a predictor.

Or we might forecast daily electricity demand $y$ using temperature $x_1$ and the day of week $x_2$ as predictors.

## Simple linear regression


In the simplest case, the regression model allows for a linear relationship between the forecast variable $y$ and a single predictor variable $x$: 

$$y_t = \beta_0 + \beta_1 x_t + \varepsilon_t.$$


<p align="center">
  <img src="img/SLRpop1-1.png" alt="Simple linear model" width="600"/>
</p>

## Example: US consumption expenditure

Time series of quarterly percentage changes (growth rates) of real personal consumption expenditure, $y$, and real personal disposable income, $x$, for the US from 1970 Q1 to 2019 Q2.

In [2]:
us_change

## Example: US consumption expenditure

Time series of quarterly percentage changes (growth rates) of real personal consumption expenditure, $y$, and real personal disposable income, $x$, for the US from 1970 Q1 to 2019 Q2.

In [3]:
us_change |>
  pivot_longer(c(Consumption, Income), names_to="Series") |>
  autoplot(value) +
  labs(y = "% change")

## Scatter plot of consumption changes against income changes

In [4]:
us_change |>
  ggplot(aes(x = Income, y = Consumption)) +
  labs(y = "Consumption (quarterly % change)",
       x = "Income (quarterly % change)") +
  geom_point() #+
  #geom_smooth(method = "lm", se = FALSE)

## Linear model to predict consumption changes given income changes

$$\hat{y}_t=0.54 + 0.27x_t.$$

In [5]:
us_change |>
  model(TSLM(Consumption ~ Income)) |>
  report()

## Multiple linear regression

When there are two or more predictor variables, the model is called a multiple regression model. The general form of a multiple regression model is 
\begin{equation}
  y_t = \beta_{0} + \beta_{1} x_{1,t} + \beta_{2} x_{2,t} + \cdots + \beta_{k} x_{k,t} + \varepsilon_t,
\end{equation}

The coefficients $\beta_1, \ldots, \beta_k$ measure the effect of each predictor after taking into account the effects of all the other predictors in the model. Thus, the coefficients measure the marginal effects of the predictor variables.

## Example: US consumption expenditure

Additional predictors may be useful for forecasting US consumption expenditure.

In [6]:
us_change |>
  select(-Consumption, -Income) |>
  pivot_longer(-Quarter) |>
  ggplot(aes(Quarter, value, colour = name)) +
  geom_line() +
  facet_grid(name ~ ., scales = "free_y") +
  guides(colour = "none") +
  labs(y="% change")

## Pairwise scatter plots

In [7]:
us_change |>
  GGally::ggpairs(columns = 2:6)

## Assumptions

\begin{equation}
  y_t = \beta_{0} + \beta_{1} x_{1,t} + \beta_{2} x_{2,t} + \cdots + \beta_{k} x_{k,t} + \varepsilon_t,
\end{equation}

When we use a linear regression model, we are implicitly making some assumptions.
 - The linear model accurately describes the true relation between the variables
 
Additionally, we make the following assumptions about the errors ($\epsilon_1, \ldots, \epsilon_T$)
- mean zero
- not autocorrelated
- they are unrelated to the predictor variables

It is also useful to have the errors being normally distributed with a constant variance in order to easily produce prediction intervals.

## Least squares estimation

In practice, of course, we have a collection of observations but we do not know the values of the coefficients $\beta_0, \beta_1, \ldots, \beta_k$. These need to be estimated from the data.

Least squares principle: choose the values of $\beta_0, \beta_1, \ldots, \beta_k$ that minimise 
$$\sum_{t=1}^T \varepsilon_t^2 = \sum_{t=1}^T (y_t -
  \beta_{0} - \beta_{1} x_{1,t} - \beta_{2} x_{2,t} - \cdots - \beta_{k} x_{k,t})^2.$$
  
  

When we refer to the estimated coefficients, we will use the notation $\hat{\beta}, \hat{\beta}_1, \ldots, \hat{\beta}_k$. 

## Example: US consumption expenditure

A multiple linear regression model for US consumption is 
$$y_t=\beta_0 + \beta_1 x_{1,t}+ \beta_2 x_{2,t}+ \beta_3 x_{3,t}+ \beta_4 x_{4,t}+\varepsilon_t,$$

In [8]:
us_change |>
  model(tslm = TSLM(Consumption ~ Income + Production + Unemployment + Savings)) |>
  report()

## Fitted values

Predictions of $y$ can be obtained by using the estimated coefficients in the regression equation and setting the error term to zero. In general we write, 
\begin{equation}
  \hat{y}_t = \hat\beta_{0} + \hat\beta_{1} x_{1,t} + \hat\beta_{2} x_{2,t} + \cdots + \hat\beta_{k} x_{k,t}.
\end{equation}

In [9]:
us_change |>
  model(tslm = TSLM(Consumption ~ Income + Production + Unemployment + Savings)) |>
  augment()

In [10]:
us_change |>
  model(tslm = TSLM(Consumption ~ Income + Production + Unemployment + Savings)) |>
  augment() |>
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Consumption, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  labs(y = NULL, title = "Percent change in US consumption expenditure") +
  scale_colour_manual(values=c(Data="black",Fitted="#D55E00")) +
  guides(colour = guide_legend(title = NULL))

In [11]:
us_change |>
  model(tslm = TSLM(Consumption ~ Income + Production + Unemployment + Savings)) |>
  augment() |>
  ggplot(aes(x = Consumption, y = .fitted)) +
  geom_point() +
  labs(
    y = "Fitted (predicted values)",
    x = "Data (actual values)",
    title = "Percent change in US consumption expenditure"
  ) +
  geom_abline(intercept = 0, slope = 1, color="blue")

## Goodness-of-fit

A common way to summarise how well a linear regression model fits the data is via the coefficient of determination, or $R^2$. 
$$R^2 = \frac{\sum(\hat{y}_{t} - \bar{y})^2}{\sum(y_{t}-\bar{y})^2},$$
where the summations are over all observations. 

In simple linear regression, the value of $R^2$ is also equal to the square of the correlation between y and x (provided an intercept has been included).



If the predictions are close to the actual values, we would expect $R^2$ to be close to 1.

On the other hand, if the predictions are unrelated to the actual values, then $R^2=0$. 

In all cases, $R^2$ lies between 0 and 1.

The $R^2$ value is used frequently, though often incorrectly, in forecasting. The value of $R^2$ will never decrease when adding an extra predictor to the model and this can lead to over-fitting. 

## Example: US consumption expenditure

In [12]:
us_change |>
  model(tslm = TSLM(Consumption ~ Income + Production + Unemployment + Savings)) |>
  augment() |>
  ggplot(aes(x = Consumption, y = .fitted)) +
  geom_point() +
  labs(
    y = "Fitted (predicted values)",
    x = "Data (actual values)",
    title = "Percent change in US consumption expenditure"
  ) +
  geom_abline(intercept = 0, slope = 1, color="blue")

In [13]:
us_change |>
  model(tslm = TSLM(Consumption ~ Income + Production + Unemployment + Savings)) |>
  report()

## Standard error of the regression

Another measure of how well the model has fitted the data is the standard deviation of the residuals, which is often known as the “residual standard error”.

\begin{equation}
  \hat{\sigma}_e=\sqrt{\frac{1}{T-k-1}\sum_{t=1}^{T}{e_t^2}},
\end{equation}

where $k$ is the number of predictors in the model

## Evaluating the regression model

The differences between the observed $y$ values and the corresponding fitted $\hat{y}$ values are the training-set errors or “residuals” defined as, 
\begin{align*}
  e_t &= y_t - \hat{y}_t \\
      &= y_t - \hat\beta_{0} - \hat\beta_{1} x_{1,t} - \hat\beta_{2} x_{2,t} - \cdots - \hat\beta_{k} x_{k,t}
\end{align*}

The residuals have some useful properties including the following two: 
$$
\sum_{t=1}^{T}{e_t}=0 \quad\text{and}\quad \sum_{t=1}^{T}{x_{k,t}e_t}=0\qquad\text{for all $k$}.
$$

As a result of these properties, it is clear that the average of the residuals is zero, and that the correlation between the residuals and the observations for the predictor variable is also zero. 

## ACF plot of residuals

When fitting a regression model to time series data, it is common to find autocorrelation in the residuals. In this case, the estimated model violates the assumption of no autocorrelation in the errors, and our forecasts may be inefficient — there is some information left over which should be accounted for in the model in order to obtain better forecasts.

In [14]:
fit_consMR <- us_change |>
  model(tslm = TSLM(Consumption ~ Income + Production + Unemployment + Savings))

fit_consMR |> gg_tsresiduals()

## Residual plots against predictors

We would expect the residuals to be randomly scattered without showing any systematic patterns. 

A simple and quick way to check this is to examine scatterplots of the residuals against each of the predictor variables. 

In [15]:
us_change |> 
    left_join(augment(fit_consMR)) |>
    pivot_longer(Income:Unemployment,
               names_to = "regressor", values_to = "x") |>
  ggplot(aes(x = x, y = .resid)) +
  geom_point() +
  facet_wrap(. ~ regressor, scales = "free_x") +
  labs(y = "Residuals", x = "")

## Residual plots against fitted values

A plot of the residuals against the fitted values should also show no pattern. 

In [16]:
fit_consMR |>
  augment() |>
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() + labs(x = "Fitted", y = "Residuals")

## Outliers and influential observations

Observations that take extreme values compared to the majority of the data are called *outliers*.

Observations that have a large influence on the estimated coefficients of a regression model are called *influential observations*. 

Usually, influential observations are also outliers that are extreme in the $x$ direction.

One source of outliers is incorrect data entry.

Outliers also occur when some observations are simply different. 

## Example: US consumption expenditure

<p align="center">
  <img src="img/outlier-1.png" alt="Outliers" width="1200"/>
</p>

## Spurious regression

More often than not, time series data are “non-stationary”; that is, the values of the time series do not fluctuate around a constant mean or with a constant variance.

Different time series may appear to be related simply because they both trend upwards in the same manner.
However, they may not be related to one another at all!

<p align="center">
  <img src="img/spurious-1.png" alt="Spurious" width="800"/>
</p>

Cases of spurious regression might appear to give reasonable short-term forecasts, but they will generally not continue to work into the future.

##  Some useful predictors

There are several useful predictors that occur frequently when using regression for time series data.

## Trend

It is common for time series data to be trending. A linear trend can be modelled by simply using $x_{1,t}=t$ as a predictor.

$$y_{t}= \beta_0+\beta_1t+\varepsilon_t,$$

A trend variable can be specified in the `TSLM()` function using the `trend()` special. 

## Dummy variables

So far, we have assumed that each predictor takes numerical values. But what about when a predictor is a categorical variable taking only two values (e.g., “yes” and “no”)? Such a variable might arise, for example, when forecasting daily sales and you want to take account of whether the day is a public holiday or not. So the predictor takes value “yes” on a public holiday, and “no” otherwise.

A dummy variable can also be used to account for an outlier in the data. Rather than omit the outlier, a dummy variable removes its effect.

## Seasonal dummy variables

Suppose that we are forecasting daily data and we want to account for the day of the week as a predictor. Then the following dummy variables can be created.

<p align="center">
  <img src="img/seasonal_dummies.png" alt="Seasonal dummies" width="800"/>
</p>

Notice that only six dummy variables are needed to code seven categories.

The interpretation of each of the coefficients associated with the dummy variables is that it is a measure of the effect of that category relative to the omitted category.

## Example: Australian quarterly beer production

In [17]:
recent_production <- aus_production |>
  filter(year(Quarter) >= 1992)
recent_production |>
  autoplot(Beer) +
  labs(y = "Megalitres",
       title = "Australian quarterly beer production")

We want to forecast the value of future beer production. We can model this data using a regression model with a linear trend and quarterly dummy variables, 
$$y_{t} = \beta_{0} + \beta_{1} t + \beta_{2}d_{2,t} + \beta_3 d_{3,t} + \beta_4 d_{4,t} + \varepsilon_{t},$$

In [18]:
fit_beer <- recent_production |>
  model(TSLM(Beer ~ trend() + season()))
report(fit_beer)

In [19]:
augment(fit_beer) |>
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Beer, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  scale_colour_manual(
    values = c(Data = "black", Fitted = "#D55E00")
  ) +
  labs(y = "Megalitres",
       title = "Australian quarterly beer production") +
  guides(colour = guide_legend(title = "Series"))

In [20]:
augment(fit_beer) |>
  ggplot(aes(x = Beer, y = .fitted,
             colour = factor(quarter(Quarter)))) +
  geom_point() +
  labs(y = "Fitted", x = "Actual values",
       title = "Australian quarterly beer production") +
  geom_abline(intercept = 0, slope = 1) +
  guides(colour = guide_legend(title = "Quarter"))

## What if we do not use seasonal dummy variables?

In [21]:
recent_production |>
  model(TSLM(Beer ~ trend())) |>
  augment() |>
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Beer, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  scale_colour_manual(
    values = c(Data = "black", Fitted = "#D55E00")
  ) +
  labs(y = "Megalitres",
       title = "Australian quarterly beer production") +
  guides(colour = guide_legend(title = "Series"))

In [22]:
recent_production |>
  model(TSLM(Beer ~ trend())) |>
  augment() |>
  ggplot(aes(x = Beer, y = .fitted,
             colour = factor(quarter(Quarter)))) +
  geom_point() +
  labs(y = "Fitted", x = "Actual values",
       title = "Australian quarterly beer production") +
  geom_abline(intercept = 0, slope = 1) +
  guides(colour = guide_legend(title = "Quarter"))

## Distributed lags

It is often useful to include advertising expenditure as a predictor. However, since the effect of advertising can last beyond the actual campaign, we need to include lagged values of advertising expenditure. Thus, the following predictors may be used. 

\begin{align*}
  x_{1} &= \text{advertising for previous month;} \\
  x_{2} &= \text{advertising for two months previously;} \\
        & \vdots \\
  x_{m} &= \text{advertising for $m$ months previously.}
\end{align*}



## "Irregular" holidays

Easter differs from most holidays because it is not held on the same date each year, and its effect can last for several days. In this case, a dummy variable can be used with value one where the holiday falls in the particular time period and zero otherwise.

## Selecting predictors

When there are many possible predictors, we need some strategy for selecting the best predictors to use in a regression model.

How to choose? Look at measures of predictive accuracy.

In [23]:
## Example: US consumption expenditure

fit_consMR <- us_change |>
  model(model.1 = TSLM(Consumption ~ Income),
        model.2 = TSLM(Consumption ~ Income + Savings),
        model.3 = TSLM(Consumption ~ Income + Production + Unemployment + Savings))

fit_consMR |>
    glance() |>
    select(.model, adj_r_squared, CV, AIC, AICc, BIC)

For the CV, AIC, AICc and BIC measures, we want to find the model with the lowest value.
For Adjusted R2, we seek the model with the highest value.

## Adjusted $R^2$

$$\bar{R}^2 = 1-(1-R^2)\frac{T-1}{T-k-1},$$

where $T$ is the number of observations and $k$ is the number of predictors. 

This is an improvement on $R^2$, as it does not necessarily increases with each added predictor. 

Maximising $\bar{R}^2$ works quite well as a method of selecting predictors, although it does tend to err on the side of selecting too many predictors.

## Cross-validation

The procedure uses the following steps:

1. Remove observation $t$ from the data set, and fit the model using the remaining data. Then compute the error $(e^∗_t=y_t−\hat{y}_t$) for the omitted observation. 
2. Repeat step 1 for $t=1,\ldots,T$
3. Compute the MSE from $e^∗_1, \ldots, ,e^*_T$. We shall call this the CV.

## Example: US consumption


<p align="center">
  <img src="img/table_cv.png" alt="Table CV" width="800"/>
</p>

## Stepwise regression

If there are a large number of predictors, it is not possible to fit all possible models.
For example, 40 predictors leads to $2^{40}$ > 1 trillion possible models! 

Consequently, a strategy is required to limit the number of models to be explored.

An approach that works quite well is backwards stepwise regression:

- Start with the model containing all potential predictors.
- Remove one predictor at a time. Keep the model if it improves the measure of predictive accuracy.
- Iterate until no further improvement.

If the number of potential predictors is too large, then the backwards stepwise regression will not work and forward stepwise regression can be used instead

## Forecasting with regression

Recall that predictions of $y$ can be obtained using:
$$\hat{y_t} = \hat\beta_{0} + \hat\beta_{1} x_{1,t} + \hat\beta_{2} x_{2,t} + \cdots + \hat\beta_{k} x_{k,t}$$

What we are interested in here, however, is forecasting **future values** of $y$.

## Ex-ante versus ex-post forecasts

**Ex-ante forecasts**
Made using only the information that is available in advance. 

These are genuine forecasts, made in advance using whatever information is available at the time. 

Therefore in order to generate ex-ante forecasts, the model requires forecasts of the predictors. 
  
**Ex-post forecasts**
Made using later information on the predictors. These are not genuine forecasts, but are useful for studying the behaviour of forecasting models.
 
 The model from which ex-post forecasts are produced should not be estimated using data from the forecast period. That is, ex-post forecasts can assume knowledge of the predictor variables (the $x$ variables), but should not assume knowledge of the data that are to be forecast (the $y$ variable).

## Scenario based forecasting

In this setting, the forecaster assumes possible scenarios for the predictor variables that are of interest. For example, a US policy maker may be interested in comparing the predicted change in consumption when there is a constant growth of 1% and 0.5% respectively for income and savings with no change in the employment rate, versus a respective decline of 1% and 0.5%, for each of the four quarters following the end of the sample. 

In [24]:
fit_consBest <- us_change |>
  model(
    lm = TSLM(Consumption ~ Income + Savings + Unemployment)
  )

fit_consBest |> report()

In [25]:
future_scenarios <- scenarios(
  Increase = new_data(us_change, 4) |>
    mutate(Income=1, Savings=0.5, Unemployment=0),
  Decrease = new_data(us_change, 4) |>
    mutate(Income=-1, Savings=-0.5, Unemployment=0),
  names_to = "Scenario")

future_scenarios

In [26]:
fc <- fit_consBest |> forecast(new_data = future_scenarios)

us_change |>
  autoplot(Consumption) +
  autolayer(fc) +
  labs(title = "US consumption", y = "% change")

## Building a predictive regression model

An alternative formulation is to use as predictors their lagged values. Assuming that we are interested in generating a h-step ahead forecast we write 
$$y_{t+h}=\beta_0+\beta_1x_{1,t}+\dots+\beta_kx_{k,t}+\varepsilon_{t+h}$$
for $h=1,2, \ldots$ 

Including lagged values of the predictors does not only make the model operational for easily generating forecasts, it also makes it intuitively appealing (no simultaneous effects).

In [27]:
fitc <- us_change |>
  model(
    lm = TSLM(Consumption ~ Income + Savings + Unemployment),
    lm_lag_1 = TSLM(Consumption ~ lag(Income) + lag(Savings) + lag(Unemployment)),
    lm_lag_2 = TSLM(Consumption ~ lag(Income)),
    lm_lag_3 = TSLM(Consumption ~ lag(Consumption) + lag(Income))
  )

fitc |> glance() |> select(.model, adj_r_squared, CV)

## Guided Workbook (part 1)

## Nonlinear regression

Although the linear relationship assumed so far in this chapter is often adequate, there are many cases in which a nonlinear functional form is more suitable. To keep things simple in this section we assume that we only have one predictor $x$.

The simplest way of modelling a nonlinear relationship is to transform the forecast variable $y$ and/or the predictor variable $x$ before estimating a regression model.

For example:
$$y=f(x) +\varepsilon$$
where $f$ is a nonlinear function.

Alternative example. A log-log functional form is specified as 
$$\log y=\beta_0+\beta_1 \log x +\varepsilon.$$

## Example: Boston marathon winning times

In [28]:
boston_men <- boston_marathon |>
  filter(Year >= 1924) |>
  filter(Event == "Men's open division") |>
  mutate(Minutes = as.numeric(Time)/60)

boston_men |>
    autoplot(Minutes)

## Piecewise linear transformations

Introduce points where the slope of $f$ can change. These points are called knots. This can be achieved by letting $x_1=x$ and introducing variable $x_2$ such that 

\begin{align*}
  x_{2} = (x-c)_+ &= \left\{
             \begin{array}{ll}
               0 & \text{if } x < c\\
               x-c &  \text{if } x \ge c.
             \end{array}\right.
\end{align*}

The notation $(x−c)_+$ means the value $x−c$ if it is positive and 0 otherwise.

Piecewise linear relationships constructed in this way are a special case of regression splines. In general, a linear regression spline is obtained using 
$$x_{1}= x \quad x_{2} = (x-c_{1})_+ \quad\dots\quad x_{k} = (x-c_{k-1})_+$$

Selecting the number of knots $(k−1)$ and where they should be positioned can be difficult and somewhat arbitrary. 

## Forecasting with a nonlinear trend

Polynomial transformations:

$$x_{1,t} =t,\quad x_{2,t}=t^2,\quad \dots.$$

However, it is not recommended that quadratic or higher order trends be used in forecasting. When they are extrapolated, the resulting forecasts are often unrealistic.

A better approach is to use the piecewise specification introduced above and fit a piecewise linear trend which bends at some point in time. We can think of this as a nonlinear trend constructed of linear pieces. If the trend bends at time $\tau$, then it can be specified by simply replacing $x=t$ and $c=\tau$ above such that we include the predictors, 
\begin{align*}
  x_{1,t} & = t \\
  x_{2,t} &= (t-\tau)_+ = \left\{
             \begin{array}{ll}
               0 & \text{if } t < \tau\\
               t-\tau &  \text{if } t \ge \tau
             \end{array}\right.
\end{align*}

## Example: Boston marathon winning times

In [29]:
boston_men <- boston_marathon |>
  filter(Year >= 1924) |>
  filter(Event == "Men's open division") |>
  mutate(Minutes = as.numeric(Time)/60)

boston_men |>
    autoplot(Minutes)

In [30]:
fit_trends <- boston_men |>
  model(
    linear = TSLM(Minutes ~ trend()),
    exponential = TSLM(log(Minutes) ~ trend()),
    piecewise = TSLM(Minutes ~ trend(knots = c(1950, 1980)))
  )

fc_trends <- fit_trends |> forecast(h = 10)

In [31]:
boston_men |>
  autoplot(Minutes) +
  geom_line(data = fitted(fit_trends),
            aes(y = .fitted, colour = .model)) +
  autolayer(fc_trends, alpha = 0.5, level = 95) +
  labs(y = "Minutes",
       title = "Boston marathon winning times")

## Correlation is not causation

It is important not to confuse correlation with causation, or causation with forecasting. A variable $x$ may be useful for forecasting a variable y, but that does not mean $x$ is causing $y$. 

It is possible that $x$ is causing $y$, but it may be that $y$ is causing $x$, or that the relationship between them is more complicated than simple causality.

For example, it is possible to model the number of drownings at a beach resort each month with the number of ice-creams sold in the same period. 

Similarly, it is possible to forecast if it will rain in the afternoon by observing the number of cyclists on the road in the morning. 



## Spurious correlations

<p align="center">
  <img src="img/5825_google-searches-for-taylor-swift_correlates-with_fossil-fuel-use-in-british-virgin-islands.svg" alt="Taylor Swift" width="800"/>
</p>

Credit: http://www.tylervigen.com/spurious-correlations

## Forecasting with correlated predictors

When two or more predictors are highly correlated it is always challenging to accurately separate their individual effects.

Suppose we are forecasting monthly sales of a company for 2012, using data from 2000–2011. In January 2008, a new competitor came into the market and started taking some market share. At the same time, the economy began to decline. In your forecasting model, you include both competitor activity (measured using advertising time on a local television station) and the health of the economy (measured using GDP). It will not be possible to separate the effects of these two predictors because they are highly correlated.

## Multicollinearity and forecasting

A closely related issue is multicollinearity, which occurs when similar information is provided by two or more of the predictor variables in a multiple regression.

It can occur when two predictors are highly correlated with each other (that is, they have a correlation coefficient close to +1 or -1). In this case, knowing the value of one of the variables tells you a lot about the value of the other variable.

When multicollinearity is present, the uncertainty associated with individual regression coefficients will be large. This is because they are difficult to estimate. Consequently, statistical tests (e.g., t-tests) on regression coefficients are unreliable. (In forecasting we are rarely interested in such tests.) Also, it will not be possible to make accurate statements about the contribution of each separate predictor to the forecast.

It is always a little dangerous when future values of the predictors lie much outside the historical range, but it is especially problematic when multicollinearity is present.

## Guided Workbook (part 2)

## Next Time

- Exponential smoothing
- Practice midterm (1.5 hours)