**Residual testing** to gauge whether linear regression is appropriate: [notebook](linear-regression-hypothesis-testing.ipynb).

### QR Decomposition - when the column number of $X$, $p$ is large

When the column dimension of $X$, i.e. $p$, is extremely large, software usually utilizes QR Decomposition to calculate the OLS estimator.

$$X = QR$$

where $Q\in \mathbb{R}^{N\times(p+1)}$ is an orthogonal matrix: $Q^{T}Q=I$, and $R\in \mathbb{R}^{(p+1)\times(p+1)}$ is an upper triangular matrix. This decomposition is borned out of Gram-Schmidt procedure on the columns of $X$. As such, the OLS estimators can be expressed as

$$\hat{\beta}=R^{-1}Q^{T}y$$
$$\hat{y}=QQ^{T}y$$.

### There are too many data - when $N$ is large

There may be situations where $N$ is so large, that $X$ cannot be fit into memory, let alone compute $X'X$. Another formula that we can rely on to compute $X'X$ is the following:
\begin{align}
\sum_{n=1}^Nx_n \cdot x_n',
\end{align}
where $x_n\in R^{(p + 1) \times 1}$ are the row vectors of $X$. That is, each $x_n \cdot x_n'$ is a $(p + 1) \times (p + 1)$ matrix. Assuming $p$ is not too large, we should be able to invert that matrix above.

### Imperfect Multi-Colinearity

Like perfect multi-colinearity, imperfect multi-colinearity is a feature of the entire set of regressors. Unlike the perfect version, imperfect multi-colinearity does not prevent $(X'X)$ to be invertible, but depending on the numerical inversion precision, imperfect multi-colinearity can cause problems in practice. Theoretical-wise, **imperfect multi-colinearity causes the standard deviations of the OLS estimates to swell, and one can say that the OLS estimate is very inaccurate given the sample size. Thus it also calls into question the hypothesis-testing on regressors affected**.

While it is somewhat easier to diagonize perfect multi-colinearity (e.g. did you fall for the dummy-variable trap, whereby you created a dummy variable for categorical regressors?), it can be hard to discern the reasons for imperfect multi-colinearity. The following are usually viewed as symptoms:
- **Large changes in the estimated regression coefficients** when a predictor variable (column of $X$) is added or deleted, or when the data $y$ is perturbed.
- **Insignificant regression coefficients for the affected variables** in the multiple regression, but a **rejection of the joint hypothesis that those coefficients are all zero** (using the F-test mentioned above).
- If a **multivariable regression finds an insignificant coefficient** of a particular explanator, yet a **simple linear regression of the explained variable on this explanatory variable shows its coefficient to be significantly different from zero**, this situation indicates multicollinearity in the multivariable regression.
- Estimate of **one regression slope is very largely positive, while another is very largely negative**.
- **Correlation matrix** of the regressors can also be informative.
- The **Variance Inflation Factor (VIF)** for a given regressor $j$: $VIF = \frac{1}{1-R_j^2}$, where $R_j^2$ is the $R^2$ when regressing the $j$-th regressor on all other regressors. A **VIF of 5 or 10 and above indicates multi-colinearity**.
- **Condition Number of $X$**. The condition number is the square root of the **maximum singular value of $X$ over the minimum singular value of $X$** (note that if we have a singular value to be zero, $X$ is not full-rank). If the condition number is **larger than 30**, it is ill-conditioned, and we have a multi-colinearity problem.


Depending on the applications, especially when the it is not required to do hypothesis testing on $\beta$, imperfect multi-colinearity need or need not to be remedied. But if it is needed to be dealt with, the following can be helpful.
- Standardize your independent variables. This may help reduce a false flagging of a condition index above 30.
- Drop one or some of the regressors if it or they are highly correlated with regressors to keep.
- Resort to ridge, lasso regressions, or PLR or PLS.