# CHAPTER 3: Linear Regression

## Simple Linear Regression

Predicting a quantitative response $Y$ on the basis of a single predictor variable $X$, assuming a linear relationship:
$$
Y \sim \beta_{0} + \beta_{1}X
$$

Using training data we find estimates for the coefficient and we can predict a response value using $x$ with the formula $\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x$.

### Estimating the coefficients

There are many ways to estimate the coefficients using training data, we now see the least squares criterion: using the above 
formula with point $x_i$, we have that 

$e_i = y_i - \hat{y}_i$

is the $i$-th residual, and we define the residual sum of squares as $RSS = e_1^2  +  ...  +  e_n^2$.

After some computation, one can show that the coefficients that minimize the $RSS$ are:

- $\hat{\beta}_1 = \dfrac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sum(x_i-\bar{x})^2}$,

- $\hat{\beta}_0 = \bar{y} - \hat{\beta}_1\bar{x}$.

### Assessing the accuracy of the Coefficient Estimates

The population regression line is $Y = \beta_{0} + \beta_{1}X + \epsilon$.

Analogy between :
- the computed regression line and the population regression line

vs

- the sample mean and the population mean

e.g., we know that the variance of $\hat{\mu}$ is $\text{Var}(\hat{\mu}) = SE(\hat{\mu})^2 = \dfrac{\sigma^2}{n}$, where $\sigma$ is the standard deviation of the response training data. The standard error tells how much on average the estimated mean differs from the population mean.
Analogously, the standard errors of the estimated coefficients are:

- $SE(\hat{\beta}_0)^2 = \sigma^2\left[\dfrac{1}{n} + \dfrac{\bar{x}^2}{\sum(x_i-\bar{x})^2}\right]$,
- $SE(\hat{\beta}_1)^2 = \dfrac{\sigma^2}{\sum(x_i-\bar{x})^2}$,

where $\sigma^2 = \text{Var}(\epsilon)$. The errors $e_i$ for each observation should be uncorrelated and have common variance, but the formula remains a good approximation.

Generally, $\sigma^2$ is unknown. The estimate of $\sigma$ is called the **residual standard error**: $RSE = \sqrt{\dfrac{RSS}{n-2}}$

#### Use of standard error for Confidence Intervals and Hypothesis Testing

95% confidence interval of $\hat{\beta}_1$ is approximately $\hat{\beta}_1 \pm 2 \cdot SE(\hat{\beta}_1)$, and the same holds for $\hat{\beta}_0$.

Hypothesis test on the coefficients, where null hypothesis is: NO relationship between $X$ and $Y$, corresponding to $\beta_1 = 0$.
To test the null hypothesis, we need to determine whether $\hat{\beta}_1$, our estimate for $\beta_1$, is sufficiently far from zero that we can be confident that $\beta_1$ is non-zero. To do so, a $t$-statistic is computed, 
$$
t = \dfrac{\hat{\beta}_1 - 0}{SE(\hat{\beta}_1)},
$$
which measures the number of standard deviations that $\hat{\beta}_1$ is away from 0. If there is a relationship, we expect the equation above to have a $t$-distributon with $n-2$ degrees of freedom, so it is a matter of computing the probability of observing any number $>= |t|$, assuming that $\beta_1 = 0$, and this probability is called $p$-value. If the $p$-value is sufficiently small, we can reject the null hypothesis

### Assessing the accuracy of the Model

The extent to which the model fits the data; the quality of a linear regression fit is usually assessed using the **residual standard error** (RSE) and the $R^2$ statistic.

- The RSE is an estimate of the standard deviation of $\epsilon$, roughly speaking, it is the average amount that the response will deviate from the true regression line. Another way to think about this is that even if the model were correct and the true values of the unknown coefficients $\beta_0$ and $\beta_1$ were known exactly, any prediction would still be off by the RSE.

- $R^2 = \dfrac{\text{TSS - RSS}}{\text{TSS}} = 1- \dfrac{\text{RSS}}{\text{TSS}}$, where TSS = $\sum(y_i - \bar{y})^2$. TSS measures the total variance in the response $Y$, and can be thought of as the amount of variability inherent in the response before the regression is performed. In contrast, RSS measures the amount of variability that is left unexplained after performing the regression. Hence, TSS − RSS measures the amount of variability in the response that is explained (or removed) by performing the regression, and $R^2$ measures the proportion of variability in $Y$ that can be explained using $X$.

Only in the simple linear regression, $R^2 = r^2$, where $r$ is the sample correlation between $X$ and $Y$.

## Multiple Linear Regression

$Y = \beta_0 + \beta_1X_1 + ... + \beta_pX_p + \epsilon$

We interpret $\beta_j$ as the average effect on $Y$ of a one unit increase in $X_j$, holding all other predictors fixed.
Coefficients are estimated as before, i.e. we choose $\hat{\beta}_j$'s that minimize the RSS, but with more complicated linear algebra formulas.