# Linear Regression

## Model Specification

$$y = X\beta + \epsilon$$ 

Here $y\in N\times1$ is a vector of independent variables, $X\in N\times (p+1)$ is the design matrix, usually including the intercept - thus there are $p$ regressors, and $\beta\in p\times 1$ are the intercept or slope coefficients. $E(\epsilon)=0$. 

### Key assumptions and the OLS estimator
The usual Ordinary Least Square *assumptions* are

- A1: **Conditional Independence or Exogeneity** $E(\epsilon_n|x_n)=0$: the error term conditioning on the corresponding regressors (not on other regressors $x_m$ where $m\neq n$) has a mean of zero. If $X$ is normalized, this also implies $X$ is not correlated with $\epsilon$.
- A2: **i.i.d** $(x_n, y_n), n=1,\dots, N$ are i.i.d. **Note**: can be relaxed when talking about autocorrelatedness in time-series regression.
- A3: **No outliers**: $X$ and $y$ have nonzero finite fourth moments, i.e. outliers are unlikely. This serves as a reminder that the OLS estimators can be very sensitive to outliers, and hence somehow justifies the need of winsorization - the flooring and ceiling of $X$ and $y$ values at their high- and low-percent quantiles.
    - To diagnose outliers, one can refer to **Cook's distance**.
- A4: **Full rank of $X$** - **no perfect multicolinearity**, otherwise $(X^{T}X)^{-1}$ cannot be found. The definition and implications of imperfect colinearity will be discussed below.
- A5: **Homogeneous Errors**: $var(\epsilon_n|x_n)=\sigma^2$. **Note**: later sections will talk about how to relax this and introduce heterogeneity
- A6: **Conditional Normality**: $\epsilon_n|x_n \sim N(0, \sigma^2)$ - this alone implies A1 and A5, though not necessarily needed homogeity.
    - To ascertain A6, we can plot the residuals against normal in a **QQ plot**.

The **Ordinary Least Square estimator** for $\beta$ is $\hat{\beta}=(X^{T}X)^{-1}(X^{T}y)$. 

- If $X, y$ is viewed as the realizations of random vectors, then $(X^{T}X)$ is actually the sample variance of $X$, and $X^{T}y$ a sample covariance of $X$ and $y$, both up to a constant which is canceled in OLS above.

With the OLS estimates, the **predicted values of $y$** is given by 

$$\hat{y}=X\hat{\beta}=X(X^{T}X)^{-1}X^{T}y,$$

where the matrix $H:=X(X^{T}X)^{-1}X^{T}$ is sometimes called the **hat matrix**, since it puts the hat on $y$. It is also a **projection matrix**, since $\hat{y}$ is an orthogonal projection of $y$ on the linear subspace of $H$ - one consequence is $y-\hat{y}$ is orthogonal to $\hat{y}$. The estimate for $\sigma^2$ is 

$$\hat{\sigma^2}=\frac{1}{N-p-1}\sum_{n=1}^N(y_n-\hat{y_n})^2=\frac{1}{N-p-1}y^{\top}(I-H)y=\frac{1}{N-p-1}SSR$$

See description of SSR below. For the reasons that H is a projection matrix, $\hat{\sigma^2}$ is independent of $\hat{\beta}$.

#### Intuitions about OLS estimator

OLS $\hat\beta$ is solving for projection of $y$ onto the linear space of $X$, such that the errors are minimized.
- In this way, the errors are the projection of y on the **orthogonal space of $X$**.
- It is possible that while $y$ is almost orthogonal to either $x_1$ and $x_2$, but $x_1$ and $x_2$ together span $y$.

### [Hypothesis Testing](linear-regression-hypothesis-testing)

- Z-score and t-stat
    - How does they change if data is duplicated, missing or otherwise changed?

- F test
    - What does it test about?
    - What assumption of A1-A6 does it rely on?
    - What is its relationship with the z-score above?

### [Regressors subset selection](linear-regression-feature-selection.ipynb)

- What are the motivations to do feature selection?

- Best-Subset Selection
    - cross-validation
    - AIC/BIC
 
- Forward-stepwise Selection
    - There is a clever updating algorithms can exploit the QR decomposition.
    - Can it be done when $p>>N$?

- Backward-stepwise Selection
    - Can it be done when $p>>N$?
 
- Forward-Stagewise Selection

### [Shrinkage Methods](linear-regression-shrinkage.ipynb)

- Are shrinkage scale invariant?
- Do we penalize intercept in shrinkage? 
- Furthermore, what are exact mathematical formulation of L1 and L2 regularizations, and their equivalent constrained optimization?
- What are ridge and lasso in a Bayesian perspective?
- How does ridge and lasso behave for the principle components in $X$?
- What is the motivation of elastic net? What is its mathematical formulation?

### [Variants and Generalizations ](linear-regression-variants-generalizations.ipynb)

- Time-series regression
- HAC
- LWLR
- PCR
- PLS

Below are the usual regressions run in empirical asset pricing literature
- Portfolio sort analysis
- Fama-Macbeth
- Two-Pass Cross Sectional Regression

### [Theoretical Properties](linear-regression-theoretical.ipynb)
- Gauss Markov Theorem
- Large- and Small-sample properties of the OLS estimators

### Advantages

- Linear regression is very simple, and as such not likely to have a large variance or overfit the data.
- The relation and impact of the regressors on the independent variables is very clear: just the components of $\beta$.

### Disadvantages

- Linear regression, assuming linear relations, can have a large bias.

### Relation to Other Models

- When one replaces the square loss with entropy or sigmoid loss, it becomes the [logistic regression](logistic_regression.ipynb).
- High-order terms of the regressors and/or cross-moments can be easily incorporated as new regressors, and we venture into the land of non-linear regression, and related to [support vector machines](SVM.ipynb).

## Empirical Performance

### Advantages 

- Linear regressions are **very quick to implement and train**, compared to other machine learning models. And thus it is an **ideal base model**.
- For prediction purposes, **linear regression models can sometimes outperform fancier nonlinear models, especially in situations with small number of training cases, low signal-to-noise ratio or sparse data**. This is related to the low-variance-high-biasedness feature of linear regressions. Note that this superior performance in forecasting need not have causal interpretation: if you see pedestrians carrying umbrellas, you might forecast rain, even though carrying an umbrella does not cause rain.

### Disadvantages

- For linear regressions, **additional data preprocessing may be required**. 
    - For example, it is known that outliers can greatly skew the estimates, so outliers need to be either winsorized (capped or floored by the corresponding quantiles) or the datapoint should be outright discarded - alas, in many applications this may not be desired, since outliers are also what actually happens in real life. Another way to rid outliers is via transforms, such as taking logs.
- In the case where we have many regressors, one should also **be ware of perfect multi-colinearity** (in which the assumptions about OLS are violated) and imperfect multi-colinearity (see discussions below).

## Implementation Details and Practical Tricks

The `scikit-learn` package does not seem to have as many functionalities for linear regression as `statsmodels`. Thus in the following we mainly discuss the usage of `statsmodel`.

In [1]:
import numpy as np
import statsmodels.api as sm

data = sm.datasets.stackloss.load()
y = data.endog
X = data.exog
# Y = [1, 3, 4, 5, 2, 3, 4]
# X = range(1,8)
X = sm.add_constant(X) # handy for adding constants for regressors
model = sm.OLS(y, X) # note that df and ssr calculation will be distorted by weights if we are using WLS
results = model.fit() 
results.summary() # display all sorts of statistics and metrics: see the section on Results Interpretation, Metrics and Visualization below.



0,1,2,3
Dep. Variable:,y,R-squared:,0.914
Model:,OLS,Adj. R-squared:,0.898
Method:,Least Squares,F-statistic:,59.9
Date:,"Thu, 20 Jan 2022",Prob (F-statistic):,3.02e-09
Time:,22:19:37,Log-Likelihood:,-52.288
No. Observations:,21,AIC:,112.6
Df Residuals:,17,BIC:,116.8
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-39.9197,11.896,-3.356,0.004,-65.018,-14.821
x1,0.7156,0.135,5.307,0.000,0.431,1.000
x2,1.2953,0.368,3.520,0.003,0.519,2.072
x3,-0.1521,0.156,-0.973,0.344,-0.482,0.178

0,1,2,3
Omnibus:,0.713,Durbin-Watson:,1.485
Prob(Omnibus):,0.7,Jarque-Bera (JB):,0.14
Skew:,-0.193,Prob(JB):,0.932
Kurtosis:,3.107,Cond. No.,1810.0


In [2]:
import statsmodels
statsmodels.__version__

'0.10.0'

In [10]:
# dir(results)

In [4]:
results.ssr

178.82996159835858

In [5]:
results.mse_resid * results.df_resid

178.82996159835858

In [6]:
results.resid

array([ 3.23463723, -1.91748529,  4.555533  ,  5.69777417, -1.71165358,
       -3.0069397 , -2.38949071, -1.38949071, -3.1443789 ,  1.26719408,
        2.63629676,  2.77946036, -1.42856088, -0.05049929,  2.36141836,
        0.9050508 , -1.51995059, -0.45509295, -0.59825656,  1.41214728,
       -7.23771286])

In [7]:
results.resid.dot(results.resid)

178.82996159835858

In [8]:
results.params

array([-39.91967442,   0.7156402 ,   1.29528612,  -0.15212252])

If needs to do a regularized fit, do `fit_regularized` instead: it can specify an elastic net whereby both lasso and ridge are special cases.

In [9]:
model.fit_regularized(method='elastic_net', alpha=0.0, L1_wt=1.0, start_params=None, profile_scale=False, refit=False)

<statsmodels.base.elastic_net.RegularizedResultsWrapper at 0x7f7764fe6400>

### Some commonly used inputs
- `method`: ‘elastic_net’ and ‘sqrt_lasso’ are currently implemented.
- `alpha`: The penalty weight. If a scalar, the same penalty weight applies to all variables in the model. If a vector, it must have the same length as params, and contains a penalty weight for each coefficient.
- `L1_wt`: The fraction of the penalty given to the L1 penalty term. Must be between 0 and 1 (inclusive). If 0, the fit is a ridge fit, if 1 it is a lasso fit.

There are other inputs that allow for parallalization.

### [Further Practical Concerns](linear-regression-practical-numerical.ipynb)

#### Hypothesis tests on the residuals of linear regression

- Jarque-Bera test
- Durbin-Watson test/Ljung-Box test
- Omnibus test

#### QR Decomposition: what is the formula of the OLS under QR decomposition of $X$?

#### Is linear regression still possible when $N$ is too large to fit in the memory?

#### Imperfect Colinearity: what are the diagostics and remedies?

## Use Cases

- It is almost universal, even when linearity is not considered to be an appropriate assumption. It is used as a base case or benchmark.
- Even if the regressors are not good in describing good *causal relations*, it can still be used for forecasting if $\beta$ is stable and $R^2$ is high.

## Results Interpretation, Metrics and Visualization

### TSS, ESS, SSR and Standard Errors

The below set the foundation of measuring the goodness-of-fit for regression models.

- **Total Sum of Squares**, or TSS, is the variation in the data of $y$: $TSS = \sum_{n=1}^N(y_n-\bar{y})^2$
- **Explained Sum of Squares**, or ESS, is the variation explained by the estimated value of the linear regression: $ESS = \sum_{n=1}^N(\hat{y_n}-\bar{y})^2$
- **Sum of Squared Residuals**, or SSR, is portion that is unexplained by the regression: $SSR = \sum_{n=1}^N(y_n-\hat{y_n})^2$
- **Standard Error**, or SER, is the unbiaed estimate of the standard deviation in the residuals, given the assumption it is homogenous: $SER = \sqrt{\frac{1}{N-p-1}SSR}$.

### $R^2$ and adjusted $R^2$

$R^2$ is the portion of variation in $y$ that is explained by the regression, so sometimes it is also called the **coefficient of determination**: 

$$R^2=\frac{ESS}{TSS}=1-\frac{SSR}{TSS}$$

The adjusted $R^2$, or $\bar{R^2}$, is a modified version of the $R^2$ that does not necessarily increase when a new regressor is added:

$$\bar{R^2}=1-\frac{N-1}{N-p-1}\frac{SSR}{TSS}$$.

- Thus **$\bar{R^2}$ is always less than $R^2$**. 
- Also, adding an additional regressor increases the ratio of sum of squares yet decreases the fraction of $\frac{N-1}{N-p-1}$, and not necessarily increase $\bar{R^2}$. 
- Finally, $\bar{R^2}$ can be negative, when $SSR$ is not that smaller than $TSS$ to compensate for $p$ increasing by 1.

$R^2$ and adjusted $R^2$ are the standard goodness-of-fit for linear regression models: adjusted $R^2$ can be used to compare different models. But there are things that $R^2$ will not tell you:

- An increase in the $R^2$ and adjusted $R^2$ does **not necessarily mean that an added variable is statistically significant**. 
    - On the one hand, $R^2$ always increases when adding extra regressors. 
    - On the other hand, whether a new variable is statistically significant or not is the job of its Z-score, not $R^2$. In fact, one can have a high
- A high $R^2$ and adjusted $R^2$ does **not mean that the regressors are a true cause of the dependent variable**, i.e. $R^2$ has nothing to do with causality.
- A high $R^2$ and adjusted $R^2$ does **not mean that there is no omitted variable**, and thus a high $R^2$ does not indicate free of omitted variable bias.
- A high $R^2$ and adjusted $R^2$ does not necessarily mean that you have the most appropriate set of regressors, nor does a low $R^2$ and adjusted $R^2$ necessarily mean that you have an inappropriate set of regressors - **it can be just that you have very low signal-to-noise ratio**. 

### [Accessing the validity of linear regression](linear-regression-validity.ipynb)

- Omitted variables: the correlation between $X$ matters
- Functional form misspecification
- Error in variables: matters how the measurement error correlates with $X$
- Sample mis-selection and missing data: different types
- Simultaneous causality.

**Incorrect calculation of the standard errors** also poses a threat to internal validity. Homoskedasticity-only standard errors are invalid if heteroskedasticity is present. If the variables are not independent across observations, as can arise in panel and time series data, then a further adjustment to the standard error formula is needed to obtain valid standard errors. See the discussion of HAC estimators above.

## References 

- [ESL](https://www.evernote.com/shard/s191/nl/21353936/c2a0e9ac-da49-4fee-8701-3cd70fc42134?title=The%20Elements%20of%20Statistical%20Learning_print12.pdf), Chapters 2, 18.6
- [< Introduction to Econometrics, 3e >](https://www.evernote.com/shard/s191/nl/21353936/23a3b1a5-8f90-47a5-b796-d29931ba8db3?title=Introduction_to_Econometrics%EF%BC%8CUpdate%EF%BC%8C3e.pdf), Chapters 4-7, 9, 12, 14-16
- [< Empirical Asset Pricing >](https://www.evernote.com/shard/s191/nl/21353936/fc98fb57-d94e-48fc-b709-610337ef92b0?title=G.BALI%EF%BC%8DEMPIRICAL_ASSET_PRICING_2016.pdf) 2016, Part I.
- Wikipedia about colinearity
- MLEDU, Lectures 6, 7.
- [< Asset Pricing >](https://www.evernote.com/shard/s191/nl/21353936/b33b7ea8-e993-74b0-8d10-3c5e6795a578?title=Asset%20Pricing), Revised Edition. John Cochrane, 2005.
- [Lars Lochstoer's lecture notes](https://www.evernote.com/shard/s191/nl/21353936/f0377478-137e-8dd9-5b05-29cf8da88dc4?title=Lochstoer's%20Asset%20Pricing%20II%20-%20Course%20Materials), Topic 1.

### Further Reading

## Misc.