# Linear Regression and Correlation
What is the association between two quantitative variables?   
What is the strength of association (correlation) between the two variables?   

**explanatory variable**: independent

**response variable**: dependent

Can we use a regression equation to predict the value of the response variable from the explanatory variable?

## Linear Relationships

- let $y$ denote the response variable
- let $x$ denote the explanatory variable

We want to underatand the relationship between $y$ and $x$, and explain it using a mathematical equation. In this sense, $y$ being the response variable, we can say that $y$ depends on $x$, or $x$ dictates the value of $y$.

The equation for a straight line:

$$y = \alpha + \beta x$$

The **slope** is how "steep" the line is and is denoted by $\beta$.  
The **intercept** is where the line crosses the y-axis (when $x = 0$) and is denoted by $\alpha$.

solving for $\alpha$   
and $\beta$: "rise over run"

**Positive Relationship**: $\beta > 0$   
**Negative Relationahip**: $\beta < 0$   
**Independence**: $\beta = 0$

### Example

We are interested in the relationship between a response variable (violent crimes at the state level) and some explanatory (independent) variables:
- poverty rate
- perentage of the population living in urban areas
- percentage of residents who are high school graduates

What is $y$? the dependent variable: *violent crimes at the state level*

What is $x$? the explanatory variables, any of: poverty rate, urban population percentage, high school graduate percentage.

The key is that we are explaining differences in violent crimes at the state level (the response, or dependent variable) using variation in the different independent variables.

There is an "art" in selecting the response variable in studies. It can often be argued either way when choosing. Be careful of *correlations* vs. *causation*, and whether a relationship between variables really explains or provides a causal relationahip.

Working the example of *violent crimes at the state level*, we have:
- poverty rate: $y = 210 + 25x$
- perentage of the population living in urban areas: $y = 26 + 8x$
- percentage of residents who are high school graduates: $y = 1756 - 16x$

### How do we interpret the equations?
- poverty rate: $y = 210 + 25x$
  - The y-intercept is 210, when $x = 0$, meaning there is no poverty, the violent crime rate is equal to 210 at the state level
  - There is a positive relationahip, a positive slope, relating $x$ (independent variable: poverty) with $y$ (dependent variable: violent crime rate at the state level)
  - this means that a 1-unit increase in poverty rate (explanatory variable) is associated with a 25-unit increase in violent crime rate at the state level (response variable)

- perentage of the population living in urban areas: $y = 26 + 8x$
  - a positive relationship
  - a 1-unit increase in population living in urban areas is associated with an 8-unit increase in violent crime rate at the state level

- percentage of residents who are high school graduates: $y = 1756 - 16x$
  - a negative relationship
  - the violent crime rates at a high school graduation rate of zero is 1756
  - there is a decreasing relationship
  - a 1-unit decrease in the percentage of residents who are high school graduates is associated (on average) with a 16-unit increase in violent crime rate at the state level

What do you do if there is no data point that shows a zero value for a variable in the dataset? How do you obtain the y-intercept if there is no data showing the $y$ value when $x$ is zero? As we are using a line to relate the two variales, the line can be extended to find the y-intercept.

### What do the questions imply about causality?

These equations do not imply anything about causality. We cannot state, for example, that if we increase percentage of residents who are high school graduates we will decrease the violent crime rate at the state level. These are not relationships that are causal in any way. The only thing we can state is, given the data this is the type of relationship that we observe overall. But we cannot make any claims of one variable causing an effect on the other.

## Least Squares Prediction Equation

Extension to linear regression, finding a line that best describes the relationship between two variables.

**Step 1: generate a scatterplot**
- look at how the data relates
- does it look like we can draw a straight line through the data points?
- are there any non-linear curves? it is inappropriate to use a Straight Line Model when there is a non-linear relationship
- a **box-and-whisker plot** along the x-axis can also help visually perceive the density or distribution
- there may be "distant" data points, or outliers, which appear to be well outside the bounds

### Prediction Equation
The prediction equation for a line:

$$\hat{y} = a+bx$$

Y-hat denotes the fact that we use sample data to estimate the slope and intercept for the equation. Using this equation we can obtain a predicted value of the response variable for any given of $x$ as long as we know the slope and intercept.

### Prediction Equation for the Best Striaght Line

To obtain the y-intercept $a$:

$$a = \bar{y} + b \bar{x}$$

where $\bar{y}$ denotes the average value of the response variable (dependent variable)   
and $\bar{x}$ is the average value of $x$ across all data points in your sample

to obtain the slope $b$:

$$b = \frac{\sum (x - \bar{x})(y - \bar{y})}{\sum (x - \bar{x})^{2}}$$

In the numerator we are summing the product of the difference of all $x$ values and the average $x$ value $\bar{x}$, and the difference of all $y$ values and the average $y$ value $\bar{y}$.

The denominator is obtained by taking the sum of the squared deviation of each observation of $x$ from the overall mean $\bar{x}$.

Executing this for every data point, we can obtain the appropriate y-intercept and slope for the dataset.

### Effects of Outliers on the Prediction Equation
Outliers exist in almost every dataset.

It is a judgement call whether to include or exclude outliers in the dataset when creating a Straight Line Model. It could be an objective decision based on some firm criteria, or it could be a subjective decision based on less quantifiable criteria.

Recall that the mean value is sensitive to outliers, and so removing outliers from a dataset will have an impact to the prediction equations (above).

### Prediction Errors are called Residuals

How good is our prediction equation?

For any value of $x$, how far off would we tend to be in predicting the $y$ value compared to actual $y$ values in the dataset?

The difference between an observed value and the predicted value of that response variable, $y - \hat{y}$, is called the residual (or the error term). It can be thought of what's left over, that is unexplained variation between a data point and what is predicted based on the model (the $y$ prediction equation).

### Calculating Residuals
Compare actual $y$ values with $y$ values produced by prediction equation.

Example,
- take an actual $x$ value from the dataset (e.g., Poverty Rate)
- plug the actual $x$ value into the prediction equation: $\hat{y} = a + b\hat{x}$
- to get the predicted $\hat{y}$ value (predicted response variable, e.g. Murder Rate) associated with the $x$ value (explanatory variable, e.g. Poverty Rate) substitute the value of $x$ in the equation
- the value will differ some because the prediction equation will follow the line, and the actual values will be slightly diferent from the predicted values

Larger residuals will be observed when outliers are included in the prediction equation. 

### Prediction Equation has Least Squares Property

The equations given for $a$ and $b$ are the values that provide the prediction equation $\hat{y} = a + bx$ for which the residual sum of squares, $SSE = \sum(y-\hat{y})^{2}$, is a minimum.

Note: SSE = sum of squared errors

The residual sum of squares describes the variation of the data around the prediction line.

The least squares line prediction has the following properties:
- the sum and the mean of the residuals equals 0
- the line passes through the point $(\bar{x}, \bar{y})$

## The Linear Regression Model

When we have a prediction equation for a line, $y = \alpha + \beta x$, each value of $x$ corresponds to a single value of $y$. 

In real life, however, not all observations with the same $x$-value have the same $y$-value. Rather, there is a conditional probability distribution over the $y$ values for a fixed $x$ value which allows for variability in $y$ for each value of $x$. 

For a given value of $x$, $\alpha + \beta x$ represents the mean of the conditional probability distribution of $y$ for subjets having that value of $x$.

### Regression Function

$$E(y) = \alpha + \beta x$$

This shows the relationship between $x$ and the mean of the conditional distribution of $y$. It is called a linear regression function because it uses a straight line to relate the mean of $y$ to the values of $x$.

A **regression function** is a mathematical function that describes how the mean of a response variable changes according to the value of an explanatory variable.

**Regression coefficients**: the intercept and the slope

### Describing Variation about the Regression Line

We can describe the variabbility of the $y$-values for all subjects having the same $x$-value. This is the *conditional standard deviation*, or $\sigma$.

### Mean Square Error: Estimating Conditional Variation

Linear regeression assumes that the standard deviation of the conditional distribution of $y$ is:
- identical at the variaous values of $x$
- normally distributed around $x$

$$s = \sqrt{\frac{SSE}{n-2}} = \sqrt{\frac{\sum(y - \hat{y})^{2}}{n-2}}$$

At any fixed value of $x$, our model predicts that $y$ varies around a mean $E(y)$ with a standard deviation $s$.

There are two degrees of freedom, the $n-2$ term, because we have two unknowns: $\alpha$ and $\beta$.

### Marginal and Conditional Distributions
- The *marginal distribution* shows the overall variability in $y$ values.
- The *conditional distribution* shows how $y$ varies at fixed value of $x$.

## Measuring Linear Association: The Correlation

The slope in a regression line tells us the direction of an association between two variables, but not its strength.

Correlation tells us about the **strength of linear association** between two variables. 

The slope of a line, $b$, depends on the units of measurement.

Correlation is the value the slope would take if both $x$ and $y$ variables had equal standard deviations. In other words, it standardizes the measure of association so that they do not depend on the unit of meaurement.

### Calculating Correlation
We can calculate the standard deviation of $x$ and $y$

$$s_{x} = \sqrt{\frac{\sum(x - \bar{x})^{2}}{n-1}}$$

$$s_{y} = \sqrt{\frac{\sum(y - \bar{y})^{2}}{n-1}}$$

The correlation, denoted by $r$, relates to the slope $b$ of the prediction equation by:

$$r = \left ( \frac{s_{x}}{s_{y}} \right ) b$$

### Properties of the Correlation
- measures the **strenth of linear association** between $x$ and $y$
- correlation must fall between -1 and 1, $-1 \le r \le 1$
  - a correlation of -1 occurs when two variables are perfectly negatively correlated with each other
  - a correlation of 1 occurs when two variables are perfectly positively correlated with each other
- correlation has the same sign as the slope
  - if two variables are negatively correlated, that means on average when you see an increase in one of the variables you see a decrease in the other
  - a positive sign on the slope means a positive sign on the correlation, so when you see an increase in one variable you tend to also see an increase in the other variable
- when $b = 0$, $r = 0$
- $r = \pm 1$ when all sample points fall exactly on the prediction line.
  - They correspond to perfect positive and negative linear associations where there is no prediction error.
- the larger the absolute value of $r$, the stronger the linear association
- correlation treats $x$ and $y$ Symmetrically, unlike the slope
- value of $r$ does not depend on the variables' units of observation

### Correlation is Useful
Correlation is useful because it allows us to compare the strength of association across multiple variables.

Example:
- we can have two linear regression equations
  - one for the relationship between murder rate and poverty rate
  - $y = 210 + 25x$
  - and another for the relatioship between murder rate and percentage of residents who are high school graduates
  - $y = 1756 - 16x$
- just looking at these equations, we cannot tell whether highschool graduation rates or poverty rate is more strongly associated with murder rates
- but if the correlation bbetween murder rate and poverty is 0.63, and the correlation between murder rate and high school graduation rates is -0.30, we know that the strength of association between merder rate and poverty is higher

### R-Squared: How well can $x$ predict $y$?
We want to know how well our regression equation performs: how well can $x$ predict $y$? To what extent does variation in $x$ predict variation in $y$?

One way to assess this is by looking at the **r-squared** statistic, which measures the proportional reduction in prediction error that we get by modelling $y$ from $x$, rather than just using the average of $y$ as a prediction.

**Rule 1** (predicting $y$ without using $x$): the best predictor is $\bar{y}$, the sample mean.

$$E_{1} = TSS = \sum (y - \bar{y})^{2}$$

TSS = Total Sum of Squares

**Rule 2** (predicting $y$ using $x$): when the relationship between $x$ and $y$ is linear, the prediction equation $\hat{y} = a + bx$ provides the best predictor of $y$. For each subject, substituting the $x$-value into this equation provides the predicted value of $y$.

$$E_{2} = SSE = \sum (y - \hat{y})^{2}$$

SSE = Sum of Square Error, or Residual Sum of Squares

(Agresti, 4ed, Chapter 9, section 4, pp. 273-274; see specifically figure 9.13 and surrounding descriptive text)

The proportional reduction in error from using the linear prediction equation instead of $\bar{y}$ (the sample mean) to predict $y$ is:

$$r^{2} = \frac{TSS - SSE}{TSS} = \frac{\sum (y - \bar{y})^{2} - \sum (y - \hat{y})^{2}}{\sum (y - \bar{y})^{2}}$$

This is known as **r-squared**, or the **coefficient of determination**.

R-squared is the square of the correlation.

The square of the correlation basically tells us the percent of variation explained when you use $x$ to predict $y$ rather than using an average value of $y$.

### Interpreting R-Squared
(see example 9.9 in textbook)

### Properties of R-Squared
- r-squared falls between 0 and 1
- if there is no prediction error, then r-squared = 1
- r-squared = 0 when there is no relationship (b = 0) between $x$ and $y$
- r-squared measures the strength of linear association

## Inference for The Slope and Correlation
Assumptions for Statistical Inference:
- the study uses randomization, such as simple random sample in a survey
- the mean of $y$ is related to $x$ by the linear equation: $E(y) = \alpha + \beta x$
- the conditional standard deviation $\sigma$ is identical at each $x$-value
- the condirtional distribution of $y$ at each value of $x$ is normal

According to the first assumption the data represent a random sample, whereas the second assumption implies that the linear regression function is valid. These two are the most important of the four assumptions.

### Test of Independence
If the normal conditional distribution of $y$ is the same at each $x$ value, then the two quantitative variables are statistically independent. For the regression function $E(y) = \alpha + \beta x$, this means that the slope $\beta$ is zero. The null hypothesis is that the variables are statistically independent.

We can test independence against $H_{a}: \beta \ne 0$, or a one-sided alternative to predict the direction of the association. The test statistic equals:

$$t = \frac{b}{se}$$

The formula for the standard error is:

$$se = \frac{s}{\sqrt{\sum (x - \bar{x})^{2}}}$$
where
$$s = \sqrt{\frac{SSE}{n-2}}$$

A small $s$ occurs when the data points show little variability about the prediction equation. Also, the standard error of $b$ is inversely related to $\sqrt{\sum (x - \bar{x})^{2}}$. This sum incxreases as the sample size increases. The $se$ also decreases when the $x$-values are more highly spread out.

The P-value for $H_{a}: \beta \ne 0$ is the two-tail probability from the t-distribution. For large $df$ (degrees of freedom), the t-distribution is similar to the standard normal, so the P-value can be approximated using the normal probability table.

### Confidence Interval for the Slope
A small P-value for $H_{0}: \beta = 0$ suggests that the regression line has a non-zero slope. A confidence interval for $\beta$ has the formula:

$$b \pm t(se)$$

Recall, the null hypothesis is asserting $\beta = 0$, and so if 0 does not fall within the resulting confidence interval we may reject the null hypothesis.

### Inference for the Correlation
The test statistic for testing $H_{0}: \rho = 0$ is:

$$t = \frac{r}{\sqrt{(1-r^{2})/(n-2)}}$$

This provides the same value as the test statistic $t = \frac{b}{se}$, since both test essentially the same hypothesis, with the same degrees of freedom.

## Model Assumptions and Violations
(see lecture video and Agresti chapter 9.6)