### Assumptions of Linear Regression

- Linearity
    - Relationships between the predictors and the outcome variable should be linear
- Homogeneity of variance (homoscedasticity)
    - Error variance is constant
- Independencee
    - The errors associated with one observation are not correlated with the errors of any other observation
    
    
More assumptions:
- Normality
- Independent predictors
- Model specification
    - Model should be properly specified (no extraneous predictors)
    
    
#### Linearity
- The relationship between the independent and dependent variable must be linear

But how do we detect violations of this assumption?

Resdiual plot: bivariate plot of the preedicted/fitted value against residuals
    - Ideally, predicted value will be a linear function of IVs
    - Line should be approximately horizontal at zero
    
#### Checking homoscedasticity
- Another residual plot: there should be no pattern of the size of the residuals against fitted values

Correcting for violations against homoscedasticity:
- Transform outcome variable using log or square root transformation
- Use quantile regression, mixed-effects models

Detecting violations of homoscedasticity for ANOVA:
- Boxplots, residual plots, Leven's test, Bartlett's test, Fligner-Killeen test

What can we do if normality is violated?
- Transform the data
- Remove outliers
- Robust statistics and methods

#### Data transformation
Left skew
- e.g., test scores for an easy test (ceiling effect)
    - Many high scores, but a few people bombed it...
    - How can we transform the data to remove this left skew?

If you have a left-skewed distribution, try these:
1. Square ($Y \rightarrow Y^2$)
2. Cube ($Y \rightarrow Y^3$)
3. Quartic (($Y \rightarrow Y^4$)

#### Box-cox transformations
$$ y \mapsto y^\lambda $$

Typically, if $|\lambda| > 1$, we round to the nearest whole number

If $\lambda = 0$, then $y \mapsto log(y)$

In R, we use the `bcPower` function (box-cox family of power transforms)

#### How to evaluate transformation outcomes
Need to check data before and after each level of transformation, else we might "overtransform" or "undertransform" our data

Effective transformations:
1. Improve normality by reducing skewness and kurtosis
2. Reduce variance when possible

Data transformation may make model interpretation difficult... some transformations are best for strictly predictive models.

#### Outliers, leverage points, influential datapoints
- Outlier — a data point drastically different from others.
- A data point is influential if its presence/absence has a big impact on the relationship/regression model.

#### How do we detect outliers and leverage points?
- We cannot simply plot every variable...
- We cannot simply plot every pair of variables...
- Outliers in high dimensional space can be well hidden

Method: MCD algorithm

How do we detect influential datapoints?
- Use Cook's distance
- An aggregated influence measure, showing the effect of the $i$th case on all fitted values.
- The higher value is, the more influential the case is.
- Rule of thumb: any Cook's distance greater than $\frac{4}{(n-p-1)}$ is a potential influential datapoint

#### Robust methods
- Tons of options! Great for analyzing data when dropping outliers loses too much information

Two categories of robust methods:
- Downweight potential outliers
- Alternative distributions

#### Independent predictors
Two variables are linear combinations of one anotheer.
- Large level of multicollinearity can cause these problems:
    - Large changes in estimated regression coefficients when a variable is added/deleeted, or when an observation is added/deleted.
    - Finding non-significant results for important variables. (Similar to a Type II error)
    
#### Testing for multicollinearity
- VIF (variance inflation factor)
    - When predictors are uncorrelated, VIF = 1
    - When predictors are correlated, VIF > 1
    - Rule of thumb: When VIF > 10, multicollinearity has an influence on parameter estimation.
- Calculate with the `vif` R function

#### Summary of assumptions
- LINEARITY (very important)
    - Linearity in terms of regression coefficients—not the IVs themselves (we can fit to a quadratic term without violating)
- Homogeneity of variance (fairly important)
- INDEPENDENCE (very important)
    - The subjects are independent from one another — each case is independent

- Normality (fairly important)
- Independent predictors (fairly important)

#### Takehome message
- Regression diagonstics are important
- We need to check whether the assumptions are satisfied and whether the analyses are correctly implemented.