#### Time-Series Regression

$$y_t
= \beta_0 + \beta_1y_{t-1} + \beta_2u_{t-2} + \dots + \beta_py_{t-p}
+ \delta_{11}X_{1t-1} + \delta_{12}X_{1t-2} + \dots + \delta_{1q_1}X_{1t-q_1}
+ \dots + \delta_{k_1}X_{kt-1} + \delta_{k2}X_{kt-2} + \dots + \delta_{kq_k}X_{kt-q_k} + u_t,$$

where
 - TS1: $E(u_t| y_{t-1}, y_{t-2}, \dots, X_{1t-1}, X_{1t-2}, \dots, X_{kt-1}, X_{kt-2}, \dots) = 0$
 - TS2: (a) The random variables $(y_t, X_{1t}, \dots, X_{kt})$ have a stationary distribution (strong stationarity), and (b) $(y_t, X_{1t}, \dots, X_{kt})$ and $(y_{t-j}, X_{1t-j}, \dots, X_{kt-j})$ become independent as $j$ gets large (weak independence).
 - TS3: large outliers are unlikely: all $X$ and $y$ have nonzero, finite fourth moments.
 - TS4: There is no perfect multi-colinearity.
 
In particular, TS1 is a natural extension to A1 above, now that the past values of $y$ as well as $X$ are regressors. **Violation of TS1 will result in the OLS estimate being bias just as with usual linear regression**; see 'accessing the validity of regression bias below'. 

TS2 (a) essentially say that the distribution is the same (strong stationary), while (b) relaxes a bit about independency (only requires independency after some time lags so that in large samples, LLN and CLT still hold). If TS2 (a) is violated, the forecast can be biased, or the forecast can be inefficient (i.e. some other based on the same data might have lower mean square forecasting errors), or the standard errors on conventional OLS estimate can be wrong and hence misleading p-value and statistical inferences. **Precisely what happens exactly and its remedy depends on the source of the non-stationarity: two common cases are when $X$ is not a stationary process (unit root) or when $u$ is heteroskestic or auto-correlated**; see below.

**Hypothesis testing can be done in the same way as ordinary OLS**. 
- The intuition is simple: whether it is ordinary OLS or time-series regression, **all probability distributions in consideration is conditioning on $X$, or the disbribution on $X$ is not important as long as it is roughly the same (stationary)**. As such, the normality and homogeity of $u_t$ might be helpful for inference in small samples, though we shall talk about heterogeneity and autoregressive-consistent standard errors below. 
- One such test that is famous in time-series regression is the **F test that test whether the loadings on $X_j$ and its lags are zero**. More precisely, the null hypothesis implies that $X_j$ has no predictive content for $y_t$ given the lag, beyond that contained in the other regressors. This is called the **Granger Causality** test. Despite its name of 'causality', it is more of a predictability test, and it is very well possible that $X$ and $y$ can mutually Granger cause one another.
- When $X$ are excess returns as factors, and $y$ are excess returns of test assets, the time-series regression above can be used to test asset pricing models, though typically $y$ and $X$ are of the same period, i.e. there is no regressing on the past. The asset pricing implication is the intercept term, usually called $\alpha$ in that context, is zero, which is an implication only when factors are traded excess returns. And $\alpha$ is what is hypothesis tested. The famous **GRS test**, which utilised small sample properties of the OLS estimates, linked the hypothesis test $\alpha=0$ to the testing of MV efficiency of test portfolio; see Section 12.1 in [< Asset Pricing >](https://www.evernote.com/shard/s191/nl/21353936/b33b7ea8-e993-74b0-8d10-3c5e6795a578?title=John%20H.%20Cochrane%20-%20Asset%20Pricing) for more details. Note that the hypothesis testing of $\alpha=0$ is not a readily available conclusion in the OLS theory below though, since we are pooling regression on multiple assets.

Determining the lag order can use the usual information criterions; see [evaluation metrics and information criterions](../meta_learning/evaluation_metrics_and_information_criterions.ipynb). When there are multiple regressors though, the computation cost can be daunting since there are many combinations that needs to be tested and compared.

When there is only one regressor $X$, it is called the **Autoregressive Distributed Lag Model (ADL)**, denoted as $ADL(p, q)$, where $p$ is the lags of the autoregressive on $y$ and $q$ is the lags on the sole regressor $X$. When there is no autoregressive terms, it is simply called a **Distributed Lag (DL)** model. 
> In DL model, the exogeneity condition of TS1 is sometimes relaxed as $E(u_t| X_t, X_{t-1}, X_{t-2}, \dots, X_{t-q}) = 0$, i.e. there is no conditioning on past $y$ since they are not in the regressors. Sometimes a stronger assumption is used for efficiency, which we call *strict exogeneity*: $E(u_t| \dots, X_{t+1}, X_t, X_{t-1}, X_{t-2}, \dots, X_{t-q})=0$. Strict exogeneity can imply exogeneity above, but not vice versa. Indeed, if the changes in $y_t$ impact future values of $X_t$, then we do not have strict exogeneity, but we have exogeneity.

> Even with strong exogeneity, $u_t$ can be serial correlated. One can reframe a DL model with $AR(p)$ $u_t$ into an $ADL$ model. If one sticks to the DL model with serial-correlated $u_t$ (such as the case of Fama-French regression), then the standard errors and statistical inference need to be adjusted using **Heteroskedacity- and Autocorrelation-Consistent Estimator (HAC)**, which we will discuss next. Since in practice the model we build is DL rather than ADL, and in particular the assumption of TS1, HAC is necessary, and hypothesis testing is different from classical linear regression.

#### HAC: Neway-West Variance Estimator

As mentioned above, in DL models, $u_t$ can be autocorrelated. One source of such correlation is that the omitted variables included in $u_t$ can themselves be serially correlated. Also as alluded before, **the autocorrelations in $u_t$ does not affect the consistency of OLS nor induces bias. If, however, the errors are autocorrelated, then the standard errors of the OLS estimators need to be adjusted**. In this sense, autocorrelatedness is analogous to heteroskedasticity. It can also be thought of as somewhat relaxing the i.i.d. assumption of $(x_n, y_n)$.

The HAC estimator of the variance of $\beta_j$ is 

$$\tilde\sigma^2_{\hat{\beta_j}}=\hat\sigma^2_{\hat{\beta_j}}\hat{f}_T,$$

where $\hat\sigma^2_{\hat{\beta_j}}$ is the estimator of the variance of OLS estimator $\hat{\beta}_j$ in the absence of serial correlation, and $\hat{f}_T$ is given by **(this formula may need work - find its reference!)**

$$\hat{f}_T=1+2\sum_{m=1}\left(\frac{M-m}{M}\right)\tilde{\rho_j},$$

where $\tilde\rho_i = \sum_{t=i+1}^T\hat{v_{j,t}}\hat{v_{j, t-i}}/\sum_{t=i+1}^T\hat{v^2_t}$, where $\hat{v_{j,t}}=(X_{j, t}-\bar{X}_j)\hat{u}_t$. The (hyper-)parameter $M$ here is called the truncation parameter. One heurestic is 

$$m=0.75T^{1/3}.$$

The above is also called the **Neway-West** variance estimator, who show that, when used along with a herustic like that above, under general assumptions that estimator is a consistent estimator of the variance of $\hat{\beta}_j$. There are weighting schemes other than $(M-m)/M$, which lead to other HAC estimators.

**Problems caused by Unit Roots**

When unit roots are present, the assumption TS2 is violated. Indeed, if a regressor has a stochastic trend (that is, has a unit root), then the OLS estimator of its coefficient and its OLS t-statistic can have nonstandard (that is, nonnormal) distributions, even in large samples. 

- **If we time-series regress an AR(1) process with unit root, the autoregressive coefficient can be biased towards 0**. 
    - More specifically, the OLS estimate of the autoregressive coefficient is **consistent, but has a distribution biased below 1**: $E(\hat{\beta}_1)=1-5.3/T$, even in large samples.
- When regressor has a unit-root, then **its usual OLS t-statistic can have a nonnormal distribution under the null hypothesis, even in large samples**. 
    - This nonnormal distribution means that conventional confidence intervals are not valid. 
    - Note that this is different from the issue addressed by the autocorrelations in the error term, by the HAC estimator above.
- The danger of **spurious regression**, whereby two unrelated time-series seem so. 
    - In this case, the regression coefficient for time-series regressing one onto another can be significant depending on the sample, probably with high $R^2$ as well, suggesting a relationship while there is none. In other words, it is **dangerous to discover relationship like this just by linear regression**. 
    - On the other hand, somewhat reversing the chain of logic, if there are reasons to believe the two processes are **cointegrated**, whereby the underlying trends of the two processes are the same, linear regression can be the means to some tests; see ADF test and Johansen test in [time-series-models](https://github.com/netantman/other-quant-methods/blob/master/time-series-models.ipynb).

#### Locally Weighted Linear Regression

See [locally_weighted_linear_regression](locally_weighted_linear_regression.ipynb).

#### Principle Component Regression (PCR)

Simply replace $X$ with its principle components $UD$ or unit principle components $U$. 

One can view this in connection to the ridge regression: 
- Ridge regression **shrink principle components with small eigenvalue**, as a **soft exclusion**; 
- PCR **discard the $p-M$ smallest eigenvalue components**.

Since PCR uses PCA, it is not scale invariant, and thus pre-scaling of $X$ is required.

#### Partial Linear Regression (PLS)

1. Standardize each $x_j$ to have mean 0 and variance 1. Set $\hat{y}^{(0)}=\bar{y}1$, and $x_j^{(0)}=x_j$, $j=1, \dots, p$.

2. For $m=1, 2, \dots, p$,

(a) $z_m=\sum_{j=1}^p\phi_{mj}x_j^{(m-1)}$, where $\phi_{mj}=<x_j^{(m-1)}, y>$.

(b) $\theta_m=\frac{<z_m, y>}{<z_m, z_m>}$.

(c) $\hat{y}^{(m)}=\hat{y}^{(m-1)}+\theta_mz_m$.

(d) Orthogonalize each $x_j^{(m-1)}$ with respect to $z_m$: $x_j^{(m)}=x_j^{(m-1)}-[z_m, x_j^{(m-1)}/<z_m, z_m>]z_m$, $j=1,2, \dots, p$.

For any $m<p$, $\hat{y}^{(m)}$ is a reduced PLS regression predicted value, and when $m=p$, it reduces back to the usual least square regression. Since $z$ are linear combinations of $x$, $\hat{y}^{(m)}=X\hat{\beta}^{(m), PLS}$.

The loadings on $x_j$ is chosen by the correlation with $y$ ('supervised' by $y$), in contrast to PCR above. The intuition is to keep peeling residuals of $x$ and pile them on $y$, maximizing the covariance. Indeed, the $m$-th PLS direction $\phi_m$ solves:

$$max_{\alpha} Corr^2(y, X\alpha)Var(X\alpha)$$

subject to

$$||\alpha||=1, \alpha^{T}S\phi_l=0, \;\;l=1, \dots, m-1.$$

It can be shown that partial least squares also tends to shrink the low-variance directions, but can actually inflate some of the higher variance directions. This can make PLS a little unstable, and cause it to have slightly higher prediction error compared to ridge regression. Put in another way, PLS downweights noisy features, but does not throw them away; as a result a large number of noisy features can contaminate the predictions.

In general, to minimize prediction errors, ridge regression is generally preferable to variable subset selection, PCR and PLS, but the improvement over the last two is only slight.

#### Designing for $X$ and $y$ - portfolio sort analysis

In the series of papers, Fama and French performs regression analysis not on the returns of individual stocks, but on returns of sorted portfolios. The stocks are sorted according to its attributes, such as market cap (usually the log of it), book-to-price ratio (also the log of it), and beta (though not done in the Fama-French paper, but is found to be of importance in later studies). $y$ is the excess return (raw return minus the risk-free rate) of these portfolios, while $X$ are taken to be the difference between a high-quantile portfolio less a low-quantile portfolio, e.g. high-value-minus-low-value. 

The benefit of such sorting mechnics, which is also Fama-French's main contributions (not the regression analysis per se) are several fold.
- The sorting itself can be viewed as *non-parametric approach* to discern relations between attributes and asset returns free of the linear assumptions in regression that might distort the results.
- In testing whether some factors explain expected returns or are priced, we need low volatility of regression residuals and preferably for the residuals to have lower correlations, so that we have lower standard errors on betas, or factor loadings. By sorting on characteristics that purported to be informative on alpha, or the cross-sectional difference of expected returns, variance should also be lower-bounded (or you will have infinite Sharpe ratio). Indeed, impressively high $R^2$ were achieved in Fama-French's quantile portfolios (in the $90+\%$), which means not only do they find dispersion on expected returns, but they also find these stocks move together or there exists some correlation structure. Indeed, in his [presidential address](https://www.evernote.com/shard/s191/nl/21353936/454d91df-398b-4080-b7ec-aa7b6adbca4d?title=Presidential%20Address:%20Discount%20Rates), John Cochrane evens describes covariance as 'Fama-French central result'

One drawback though is sorting on more than two attributes can be challenging, and the impact of one attribute controlling others is not as transparent as linear regression. As such, Cochrane famously says that 'we may have to run regressions in some sort'.

[< Empircal Asset Pricing >](https://www.evernote.com/shard/s191/nl/21353936/fc98fb57-d94e-48fc-b709-610337ef92b0?title=G.BALI%EF%BC%8DEMPIRICAL_ASSET_PRICING_2016.pdf) has a chapter that delves into the technical details of sorting, such as what happens when a stock's attribute lies on the borderline of two consecutive quantiles, whether multiple sort should be done independently or conditionally, etc.

#### Fama-Macbeth Regression

The data for this exercise is the time series of individual returns, $y_{nt}, n=1, \dots, N, t=1, \dots, T$ (think stock returns), and the corresponding attributes/characteristics $c_{ntk}, k=1, \dots, K$ each is considered a noisy signal of some risk factor (think log market cap, book-to-market ratios, etc). 

There seem two different yet related description of the Fama-Macbeth (FM) procedure in the textbooks.

- **[< Asset Pricing >](https://www.evernote.com/shard/s191/nl/21353936/b33b7ea8-e993-74b0-8d10-3c5e6795a578?title=John%20H.%20Cochrane%20-%20Asset%20Pricing), Chapter 12 (also the classical one)**
   - For any given security $n$, at each time period $t$, perform a rolling time-series regression of $y_{n(t-s)},\dots,y_{nt}$ on the attributes $c_{n(t-s)k},\dots,c_{ntk}$, $k=1,\dots,K$. The outcome is the time-varying beta estimates $\hat{\beta}_{nk}^{(t)}$. Instead of a rolling time-series regression, you can also just do full time sample and obtain a constant beta estimate through time.
   - Given the estimates of betas corresponding to $t$, perform a cross-sectional regression: 
    $$y_{nt}=\beta_{n1}^{(t)}\lambda_{t1}+\dots +\beta_{nK}^{(t)}\lambda_{tK}+\alpha_{nt}, n=1,\dots, N.$$
    Note that now the $\lambda$ are a function of $t$ across all securities. This is considered the 'price of risk' at time $t$, and testing whether $\lambda_k=0$ indicates whether the $k$-th factor is priced. Meanwhile, asset pricing models prescribes $\alpha=0$ (no pricing error). FM estimate $\lambda$ and $\alpha_n$ as the time-series average:
    $$\hat\lambda_k=\frac{1}{T}\sum_{t=1}^T\hat{\lambda}_{tk}, k=1,\dots, K\;\;\;\text{and}\;\;\;\hat\alpha_n=\frac{1}{T}\sum_{t=1}^T\hat{\alpha}_{nt},$$
     and their standard errors are estimated by the time-series variation (with or without assumption that there is serial-correlation). With these we can carry out hypothesis testing on $\lambda$ and $\alpha$.

- **[< Empirical Asset Pricing >](https://www.evernote.com/shard/s191/nl/21353936/fc98fb57-d94e-48fc-b709-610337ef92b0?title=G.BALI%EF%BC%8DEMPIRICAL_ASSET_PRICING_2016.pdf), Chapter 6**
  - For each given time $t$, regress $y_{nt}$ across $n$ (cross-sectionally) on the attributes (or you can combine several past periods at time $t$ and essentially do a rolling time-series regression):
 $$y_{nt}=\beta_{t0}+\beta_{t1}c_{nt1}+\dots +\beta_{tK}c_{ntK}+\epsilon_{nt}, n=1,\dots, N.$$
 Notice how the set of $\beta$'s only depend on $t$. From the above regression, we obtain the OLS estimates $\hat{\beta}_{tk}, k=1, \dots, K$, again for the given $t$. That is, we have a time-series of estimated betas. 
  
  - There is no second regression. Rather, the time-series statistics of beta above are summarized and hypothesis testing is performed on whether $\beta=0$. Due to possible serial correlation of $\beta$ across time, correction due to HAC is required.

The differences between the above two procedures are apparent.
- The classical procedure is a bit more complicated to implement and think about, since it involves two regressions. But it hypothesis-tests directly the implications of asseting pricing models and directly relates linear factors to the stochastic discount factors (SDF); see discussions in [< Asset Pricing >](https://www.evernote.com/shard/s191/nl/21353936/b33b7ea8-e993-74b0-8d10-3c5e6795a578?title=John%20H.%20Cochrane%20-%20Asset%20Pricing). This is what was done in famous papers such as the Black, Jensen and Scholes's testing of CAPM.
- The latter procedure is easier to think about and implement. However, it does not test pricing errors (alpha) thus no relation to SDF but rather generally whether characteristics/attributes affect cross-sectional returns.

There are many benefits of using (classical) FM, and it is used beyond asset pricing, such as corporate finance, etc.
- In testing the linear specification of SDF, the factors need not be traded excess returns in FM, but that is a restriction on time-series tests; details on the derivation of the testable implication can be found in [< Asset Pricing >](https://www.evernote.com/shard/s191/nl/21353936/b33b7ea8-e993-74b0-8d10-3c5e6795a578?title=John%20H.%20Cochrane%20-%20Asset%20Pricing).
- It allows for a time-varying beta - something that the time-series test cannot accomodate.
- Its hypothesis testing of $\alpha$ and $\lambda$ has strong economic interpretation.
- It also allow for changing number of assets.

The FM procedure (the classical one at least) is also closer to the early tests of CAPM: think Jensen, Black and Scholes, in that expected returns are plotted against the security market line prescribed by beta, or expected returns should be linear in beta. In this regard, one can see FM's natural connection to the two-pass cross-sectional regression below.

As mentioned above, the standard errors of $\alpha$ and $\lambda$ are easy to incorporate serial correlations - in asset pricing applications, it may not be a big problem of assuming away serial correlations. But it does not allow for rigorous correction due to the fact that $\beta$ are estimated, as in the Shanken's correction in two-pass cross-sectional regression (see below). The ad hoc practice is to compute and report the Shanken's correction, alongside the FM results.

#### Two-pass Cross-sectional Regression
Similar to the Fama-Macbeth procedure above, the two-pass regression is proposed to test asset pricing models. The difference of two-pass and FM is that, in the second step, rather than run regressions for different $t$, the two-pass procedure does a single cross-sectional regression, regressing $\frac{1}{T}\sum_{t=1}^Ty_{nt}$ onto the same time-series average of $\beta$; see Section 12.2 of [< Asset Pricing >](https://www.evernote.com/shard/s191/nl/21353936/b33b7ea8-e993-74b0-8d10-3c5e6795a578?title=John%20H.%20Cochrane%20-%20Asset%20Pricing). This is closest in concept since the asset pricing relation is between expected return and beta.

The two-pass procedure postdates FM, and is more rigorous in estimating standard error, in that it account for the fact that $\beta$ is estimated, a correction due to Shanken. It is said this correction term can be large if the factors are not traded assets, but macro factors; see Lochstoer's lecture notes, Topic 1, Page 26.

The drawback of the two-pass procedure above (**TODO**)