# Introduction to Regression

In a regression problem, we estimate a continuous function $f$ to estimate the true value $t$ for a given datapoint $\underline{x}$, using information from our observed data $\mathcal{S}_m=(\underline{\underline{X}}, \underline{y})\in\mathbb{R}^{m\times n}\times \mathbb{R}^n$.

# Linear Regression

🟢 **Assumption:** Linear Regression assumes that the true data can be represented by a linear equation with some noise $\underline{\varepsilon}$. This is formally written as:

$$
\underline{y} = \underline{\underline{X}}\cdot\underline{w} + \underline{\varepsilon}
$$

$\underline{\underline{X}}$ is known as the **design matrix**, and $\underline{w}\in{\mathbb{R}^n}$ is a set of unknown weights that characterise the true distribution. The true target variable that we don't observe is given by $t = \underline{x}^\top \cdot \underline{w}$. In statistical texts, the weights are called parameters and are instead represented with $\underline{\beta}$.

🟢 **Assumption:** For core linear regression problems, we assume that the errors are **i.i.d** with probability distribution $\varepsilon \sim \mathcal{N}(0, \sigma^2)$. This implies that the observed target variables are normally distributed: $\underline{y}\sim \mathcal{N}(\underline{\underline{X}}\cdot \underline{w}, \sigma^2\underline{\underline{\delta}})$ [[P1]](#P1)

## Inference

In order to determine the optimal weights, we typically use the L2 norm. The total error from the observed data is the Sum of Squares (SS).

$$
S(\underline{w}) = \underline{\varepsilon}^\top\cdot \underline{\varepsilon}
$$

The optimial weights that minimise the above expression are given by the expression below, called the **normal equations** [[P2]](#P2):

$$
\underline{\underline{X}}^\top \underline{\underline{X}} \cdot \underline{\hat{w}} = \underline{\underline{X}}^\top \underline{y}
$$

🟢 **Assumptions:** if we assume the following:
- The number of observed points is greater than the number of predictor variables, $m>n$
- $\underline{\underline{X}}$ is a full-rank matrix [[N1]](#N1)
- Then our matrix is invertible [[N2]](#N2)

We write [[N3]](#N3):

$$\underline{\hat{w}} = (\underline{\underline{X}}^\top \underline{\underline{X}})^{-1}\underline{\underline{X}}^\top \underline{y}
$$



## Ridge Regression

# Other types of Regression

## Regression and the two-sample test

## Weighted Regression

## Logistic Regression

## Probabilistic View of Linear Regression

# References

## Proofs
<a id="P1" href="../Appendix/A.Proofs/Regression.html#TargetVarDist">[P1]</a>
Derivation of the distribution of $\underline{y}$

<a id="P2" href="../Appendix/A.Proofs/Regression.html#OLS">[P2]</a>
Ordinary Least Squares solution to Linear Regression

## Notes

<a id="N1">[N1]</a>
A full rank matrix $\underline{\underline{X}}\in\mathbb{R}^{m\times n}$ is one where all $n$ columns are linearly independent. Thus if your design matrix has multicollinearity, then you cannot arrive at a unique solution to OLS. Even if you don't have perfect multicollinearity, terms that correlate highly with one another can still affect your OLS solutions due to computational errors when solving for $\underline{w}$.

<a id="N2">[N2]</a>
An invertible matrix must have all positive eigenvalues. This is a useful way to check for invertibility.

<a id="N3">[N3]</a>
Typically we don't calculate the optimal weights using the inverse because it is computationally expensive. Instead we solve for a linear system of equations.

## Sources