# Regression Overview
***

## Simple Linear Regression

Suppose that we have data $\mathcal{D}:=\{(x_i,y_i):i \in \{1,\dots,n\}\}$ where $\{(x_i,y_i)\}_{i=1}^n$ are observed values of random variables $\{Y_i\}_{i=1}^n$ and $\{X_i\}_{i=1}^n$ respectively, satisfying the relationship
\begin{equation}
    Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i.
\end{equation}
Here, $\{\varepsilon_i\}_{i=1}^n$ are uncorrelated mean zero random variables with common variance $\sigma^2$, which we call *random errors*. 
***
**Definition:** For random variables $Y$ and $X$ we define the *population regression function* (or merely the regression) of $Y$ with respect to $X$ as the function $f:\mathbb{R} \to \mathbb{R}$ such that 
\begin{equation*}
    f(x) = \mathbb{E}\left[ Y | X = x \right].
\end{equation*}
***
In general, the word *regression* is used in statistics to signify a relationship between variables. From equation $(1)$, we can see that the regression function of any $Y_i$ with respect to $X_i$ is of the form 
\begin{equation*}
    \mathbb{E}\left[ Y_i | X_i = x_i \right]  = \beta_0 + \beta_1 x_i,
\end{equation*}
which is a linear function of $x_i$. Thus, equation (1) defines a *linear regression*. One main purpose of regression is to predict $Y_i$ from of instances of $X_i$, and so it is common to refer to $Y_i$ as the response variable and $X_i$ as the predictor variable. The quantities $\beta_0$ and $\beta_1$ are called the *intercept* and *slope*, respectively, and are assumed to be fixed and unknown. Together they are known as the model *coefficients* or *parameters*. It is these unknown parameters that we wish to estimate, so that we can describe the relationship between the $Y_i$ and $X_i$.

***
**Definition:** Let $\mathcal{D}:=\{(x_i,y_i):i \in \{1,\dots,n\}\}$.
-  We define the *sample means* as 
\begin{equation*}
    \bar{x} = \frac{1}{n}\sum_{i=1}^n x_i \quad \text{ and } \quad  \bar{y} = \frac{1}{n}\sum_{i=1}^n y_i.
\end{equation*}
- We define the *sums of squares* as 
\begin{equation*}
    S_{xx} = \sum_{i=1}^n (x_i-\bar{x})^2 \quad \text{ and } \quad  S_{yy} = \sum_{i=1}^n (y_i-\bar{y})^2.
\end{equation*}
- We definethe *sums of cross-products* as 
\begin{equation*}
    S_{xy} = \sum_{i=1}^n (x_i-\bar{x})(y_i-\bar{y}).
\end{equation*}
***

### Ordinary Least Squares (OLS)

Our first estimation of $\beta_0$ and $\beta_1$ will involve drawing a straight line through the data $\mathcal{D}$ that 'comes as close as possible' to all the points. In particular, let $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$ be a straight line that predicts the value of $Y_i$ based on the observed value $x_i$ of $X_i$. Then $e_i = y_i - \hat{y}_i$ represents the $i$-th *residual*, and we aim to find estimates $\hat{\beta}_0$ and $\hat{\beta}_1$ that minimise the *residual sum of squares* 
\begin{equation}
    \text{RSS}(a,b) := \sum_{i=1}^n e_i^2 = \sum_{i=1}^n ( y_i - (a + b x_i) )^2,
\end{equation}
i.e.
\begin{equation*}
    (\hat{\beta}_0,\hat{\beta}_1) =  \text{argmin}_{(a,b)} \sum_{i=1}^n ( y_i - (a + b x_i) )^2.
\end{equation*}
If we fix $b$, then the value of $a$ that minimises $(2)$ is 
\begin{equation*}
    a = \frac{1}{n}\sum_{i=1}^n (y_i - b x_i) = \bar{y} - b \bar{x}.
\end{equation*}
Substituting this value of $a$ into $(1)$ gives us 
\begin{equation*}
    \frac{1}{n}\sum_{i=1}^n (y_i - (  \bar{y} - b \bar{x} + b x_i ))^2 = S_{yy} - 2b S_{xy} + b^2 S_{xx}.
\end{equation*}
Thus, 
\begin{equation*}
    \frac{\text{d}}{\text{d}b}  \text{RSS}(\bar{y} - b \bar{x},b) = 0 \Leftrightarrow b = \frac{S_{xy}}{S_{xx}},
\end{equation*}
whence $\frac{S_{xy}}{S_{xx}}$ is a global minimum as $\frac{\text{d}^2}{\text{d}b^2}  \text{RSS}(\bar{y} - b \bar{x},b) > 0$. We may conclude that 
\begin{equation*}
    \hat{\beta}_0 = \bar{y} - \frac{S_{xy}}{S_{xx}} \bar{x}   \quad \text{and} \quad\hat{\beta}_1 = \frac{S_{xy}}{S_{xx}}.
\end{equation*}

### Best Linear Unbiased Estimators (BLUE)

Note that in the least squares estimation of the model parameters, there was no statistical inference as such, we merely fitted a line to the data according to some criterion. We will now show that the $(\hat{\beta}_0,\hat{\beta}_1)$ as derived by OLS are optimal in a statistical sense. 

In particular, we will derive unbiased linear estimators for $\beta_0$ and $\beta_1$ that have the smallest possible variance. Such an estimator will be called a *best* estimator. An estimator is linear if it is of the form
\begin{equation}
    \sum_{i=1}^n d_i Y_i
\end{equation}
where $\{d_i\}_{i=1}^n$ are fixed constants. If (1) is an unbiased estimator of $\beta_1$, then 
\begin{equation*}
    \beta_1 = \mathbb{E}\left[\sum_{i=1}^n d_i Y_i\right] =  \beta_0  \sum_{i=1}^n d_i  + \beta_1 \sum_{i=1}^n x_i d_i,
\end{equation*}
whence 
\begin{equation}
     \sum_{i=1}^n d_i = 0 \quad \text{ and } \quad  \sum_{i=1}^n x_i d_i = 1.
\end{equation}
Note that 
\begin{equation}
     \text{Var}\left[ \sum_{i=1}^n d_i Y_i \right] = \sigma^2 \sum_{i=1}^n d_i^2.
\end{equation}
Thus, to find the BLUE estimator for $\beta_1$ we must find $\{d_i\}_{i=1}^n$ that satisfy (2) and minimize (3). One could use Lagrange multipliers to find the coefficients, but we will rely on the following lemma.
***
**Lemma:** Let $\left(v_1, \ldots, v_k\right)$ be constants and let $\left(c_1, \ldots, c_k\right)$ be positive constants. Then, for $\mathcal{A}=\left\{\mathbf{a}=\left(a_1, \ldots, a_k\right): \sum a_i=0\right\}$,
$$
\max _{\mathbf{a} \in \mathcal{A}}\left\{\frac{\left(\sum_{i=1}^k a_i v_i\right)^2}{\sum_{i=1}^k a_i^2 / c_i}\right\}=\sum_{i=1}^k c_i\left(v_i-\bar{v}_c\right)^2
$$
where $\bar{v}_c=\frac{\sum c_i v_i}{ \sum c_i }$. The maximum is attained at any $\mathbf{a}$ of the form $a_i=K c_i\left(v_i-\right.$ $\left.\bar{v}_c\right)$, where $K$ is a nonzero constant.
***