In [3]:
import numpy as np

### 2.3.1 Linear Models and Least Squares

#### Theory

Given a vector of inputs, $X^T = X_1, X_2, \dots, X_p$ we predict the output $Y$ via the model 

$$\hat Y = \hat {\beta_0} + \sum^p_{j=1}X_j \hat{\beta_j}$$

where  $\hat {\beta_j}$ is a coefficient and $\hat {\beta_0}$ is the intercept (bias) and is the mean of the dependent variable, $Y$ when we set all of the independent variables in the column vector $X$ in our model to zero. If we include the constant variable $1$ in $X$, and include the intercept in $\hat{\beta_j}$ then the model can be written as an inner product  

$$\hat Y = X^T \hat{\beta}$$

Viewed as a function over the p-dimensional input space, $f(X) = \hat Y = X^T \hat{\beta}$ is a linear function and $f'(X) = \beta$ is a vector in input space that points in the steepest uphill direction.

To learn the constants in $\beta$ we can use choose the coefficients which minimizes the resisual sum of squares

$$RSS(\beta) = \sum^N_{i=1}(y_i - x_i^T\beta)^2 = (y-X\beta)^T(y-X\beta)$$

where $X$ is an $N × p$ matrix with each row an input vector, and $y$ is an
N-vector of the outputs in the training set. Differentiating w.r.t. $\beta$ we get
the normal equations

$$X^T(y-X\beta)=0$$

If $X^TX$ is invertible (there exists an $n × n$ matrix $A$ such that $AX^TX = X^TXA=I$), then the unique solution is given by

$$\hat{\beta} = (X^TX)^{-1}X^Ty$$

The fitted value, $\hat{y_i}$ at the $i$th input $x_i$ is then


$$\hat{y_i} = x_i^T\hat{\beta}$$

## Simple linear regression

Simple linear regression is a straightforward approach for predicting a quantitative reponse $Y$ on the basis of a single predictor variable $X$. It assumes that there is approximately a linear relationship between $X$ and $Y$

$$Y \approx \beta_0 + \beta_1 X$$

$\beta_0$ and $\beta_1$ are two unknown constants that represent the intercept and slope terms in the linear model respectively. The intercept is the expected value of $Y$ when $X=0$. The slope is the average increase in $Y$ associated with a one-unit increase in $X$. They are known as the model coefficients or parameters. Through training data we can obtain estimates of the coefficients that can be used to make predictions of $Y$ on the basis of $X=x$

$$\hat{y} =  \hat{\beta_0} + \hat{\beta_1} x$$

### Estimating the coefficients

Let $(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)$ represent $n$ observation pairs, each of which consists of a measurement of $X$ and a measurement of $Y$. We want to find an intercept $\hat{\beta_0}$ and a slope $\hat{\beta_1}$ such that the resulting line is as close as possible to the $n$ observation points. The most common approach for obtaining the optimal coefficients is to minimize the least squares criterion.

Let $\hat{y_i}= \hat{\beta_0} + \hat{\beta_1} x_i$ be the prediction for $Y$ based on the $i$th value of $X$. Then $e_i = y_i - \hat{y_i}$ represents the $i$th residual, i.e. the difference between the $i$th observed response value and the $i$th predicted response value by our linear model. We define the residual sum of squares (RSS) as

$$
\begin{align}
RSS &= e_1^2 + e_2^2 + \dots + e_n^2 \\
&= (y_1 - \hat{y_1})^2 + (y_2 - \hat{y_2})^2 + \dots + (y_n - \hat{y_n})^2\\
&= (y_1 - \hat{\beta_0} + \hat{\beta_1} x_1)^2 + (y_2 -  \hat{\beta_0} + \hat{\beta_1} x_2)^2 + \dots + (y_n -  \hat{\beta_0} + \hat{\beta_1} x_n)^2
\end{align}
$$

The least squares approach selects $\hat{\beta_0}$ and $\hat{\beta_1}$ to minimize RSS. These can be shown to be

$$\hat{\beta_1} = \dfrac{\sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sum^n_{i=1}(x_i-\bar{x})^2}$$

$$\hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x}$$

where $\bar{y} \equiv \frac{1}{n} \sum^n_{i=1}y_i$ and $\bar{x} \equiv \frac{1}{n} \sum^n_{i=1} x_i$ are the sample means.

### Assessing the accuracy of the coefficient estimates

We assume that the true relationship between $X$ and $Y$ takes the form $Y=f(X) + \epsilon$ for some unknown function $f$, where $\epsilon$ is a mean-zero random error term. If $f$ is to be approximated by a linear function, then we can write this relationship as

$$Y = \beta_0 + \beta_1 X + \epsilon$$

The error term $\epsilon$ is a catch-all for what we miss with this simple model: the true relationship is probably not linear, there may be other variables that cause variation in Y , and there may be measurement error. We typically assume that the error term is independent of $X$.

### Assessing the accuracy of the model

The quality of a linear regression fit is typically assessed using two related quantities: the residual standard error (RSE) and the $R^2$ statistic.

#### Residual standard error

The RSE is an estimate of the standard deviation of $\epsilon$. Roughly speaking, it is the average amount that the response
will deviate from the true regression line.

$$RSE=\sqrt{\dfrac{1}{n-2}RSS}=\sqrt{\dfrac{1}{n-2}\sum^n_{i=1}(y_i-\hat{y_i})^2}$$