# Simple linear regression

Simple linear regression attempts to answer the question: *Is the variable $X$ related to the variable $Y$? If so, what is the relationship and can we use it to predict $Y$?* Here the variable $X$ is called the *predictor* or *explanatory* or *independent variable* or *covariate*.
The variable $Y$ is called the *response* or *dependent variable*.

## The Linear Model

 Simple Linear regression, also called *ordinary least squares* (OLS), is a good modeling tool for data where there is a *linear* relationship between an independent variable, $X$, and a dependent variable $Y$.
We can check for a linear relationship by plotting the variables in a scatterplot and trying to determine whether the points line on and around a line (the same thing when looking for correlation between the variables).
Statistical modeling involves proposing a theoretical relationship between variables and using data to estimate the unknown components, called parameters, of the model.
The theoretical model used in simple linear regression is 

$$Y=\beta_0 + \beta_1 X +\epsilon,$$ 

where $\beta_0$ is the intercept of the line (where it intersects the y-axis) and $\beta_1$ is the slope of the line (how quickly it rises or falls).
The term $\epsilon$ is an error term assumed to be from a normal distribution with mean zero and (unknown) variance, $\sigma^2$.
Note that $\beta_0, \beta_1$ and $\epsilon$ are the unknown parameters in this model.
The slope and intercept are the parameters in the model that must be estimated from the data.

The term ordinary least squares regression gets its name from the way that the intercept and slope are estimated.
Figure 1 shows a scatterplot of engine displacement versus highway miles per gallon.
The best fit regression line is the line drawn in black.
This line is found so that the sum of the squared distances from the points to the line is minimized, hence the term *least squares*.
One such distance from a point to the line is shown in red in Figure 1.

![image.png](attachment:image.png)

Once we find the best fit regression line, we can find the *predicted* response values.
If the estimate of the intercept is $b_0$ and the estimate of the slope is $b_1$, the predicted response at each predictor variable, $X$ is found as 

$$\hat{Y} = b_0 + b_1 X$$ 

It is not recommended to predict the response for $X$ values outside of the range that was used to find the best fit line.
The relationship between $Y$ and $X$ is only valid for the observed range of $X$ values.
So, for example, if the values of $X$ used to fit the line are between 0 and 100, it would not make sense to predict $Y$ for values less than 0 or greater than 100.

Since the estimates, $b_0$ and $b_1$, are calculated from the data, they are statistics and can be used to test whether there is a significant relationship between $X$ and $Y$.
This type of test is called a Wald test and tests whether the true coefficient $\beta_1$ is equal to zero or not.
If $\beta_1=0$, the theoretical model becomes $Y=\beta_0 + \epsilon$ and there is no relationship between $X$ and $Y$.
The results of the Wald test will give a *p-value*.
The p-value is a probability that measures how likely you are to get the estimate that you got (or something more extreme) if the true parameter $\beta_1$ is really zero.
The p-value is always between 0 and 1.
For large p-values, it is more likely that $\beta_1$ is equal to zero.
For small p-values, it is more likely that $\beta_1$ is not zero.
So small p-values (typically less than 0.05) indicate that there is a strong linear relationship between $X$ and $Y$.

## Checking the Model Fit

### Residual Analysis

To check the fit of the model, as in KNN regression, we compute the *residuals* as the difference between the observed responses and the predicted responses.
If the model fits well, the residuals should be small.
We can assess the fit by examining plots of the residuals.
If the linear regression model fits the data well, we should observe no patterns in the plot of $X$ versus the residuals.
Figure 2 below shows a residual plot for a good model fit.
Notice that the residuals seem to be randomly scattered across the plot.
The residuals should be centered around zero (on the y-axis) with nearly as many above zero as below.
Figure 2 exhibits this type of behavior.

![image.png](attachment:image.png)

If the residual plot seems to be randomly scattered but not centered around zero, this could indicate the presence of *outliers* in the data.
Recall that outliers are observations that are much different than the main body of the data.
The residual plot shown below (Figure 3) is an example of this situation.
Notice in this plot many of the residuals are centered around zero, most are between -5 and 5.
However, at low values of $X$ (less than 2) and high values of $X$ (between 5 and 7) there are residuals much higher, between 5 and 15.
The plot indicates that the observations that yielded these residuals should be examined more closely.

![image.png](attachment:image.png)

Sometimes the relationship between $X$ and $Y$ is not linear but is polynomial in nature, maybe a quadratic or cubic.
The residuals will reveal whether higher order terms in $X$ are needed to fit the data.
These polynomial models are still considered simple linear regression models since they are linear in terms of the parameters ($\beta_0$, $\beta_1$, $\beta_2$, $\ldots$).
The residual plot shown below in Figure 4, illustrates a case where a quadratic term in $X$ is needed, $X^2$.

![image.png](attachment:image.png)

There were two assumptions that we make with the linear model, we assume 1) the $\epsilon$â€™s are normally distributed and 2) that the variance is the same for all $X$ values (i.e.
the variance is constant).
We can use residual plots to check these assumptions as well.
Figure 5 shows a *qqplot* which compares the residuals to a normal distribution.
If the normal assumption is reasonable for the data, the qqplot of the residuals will look like a 45-degree straight line from the lower left corner to the upper right corner.
The plot of these residuals looks like a straight line in the middle sections but veers off at the lower and upper ends.
This could indicate a problem with the normal assumption for this data set.
Fortunately, for data with lots of observations the normal assumption is less crucial and we can typically proceed with plots like Figure 5.
Sometimes a transformation of the observed response variable, $Y$, such as *log* or *square root*, will improve the normality of the data.

![image.png](attachment:image.png)

The other assumptions is more crucial for successful modeling.
If the variance is not constant but instead is a function of the response, $Y$, then a plot of the predicted values, $\hat{Y}$, versus the residuals will have a funnel shape like that observed in Figure 6.
The residual plot clearly shows that for small $\hat{Y}$ the variation in residuals is small, funneling out as $\hat{Y}$ increases.
A $\log$ or square root transformation on $Y$ can sometimes fix this problem.
In this case it is called a *variance stabilizing* transformation.

![image.png](attachment:image.png)

### Coefficient of Determination

The *coefficient of determination*, $r^2$, is another way to assess the model fit.
It measures the proportion of variation in the data that is due to the regression, the remaining variation is assumed to be due to random noise from the error term, $\epsilon$.
$r^2$ is always between 0 and 1.
The bigger it is, the better the model fits the data.