# R Coding Concepts for Regression Analysis

## Theoretical Calculations

### Sum of Squares
In Regular Form 
- $SSto = \sum_{i=1}^n(Y_i-\bar{Y})^2$
    - $\frac{SSto}{\sigma^2} \sim \chi_{n-1}^2$
- $SSE = \sum_{i=1}^n(Y_i-\hat{Y}_i)^2$
    - $\frac{SSE}{\sigma^2} \sim \chi_{p}^2$
- $SSR = \sum_{i=1}^n(\hat{Y}_i-\bar{Y})^2$
    - $\frac{SSR}{\sigma^2} \sim \chi_{n-p-1}^2$
    
In Matrix Form
- $SSto = \underline{Y}^T\underline{Y} - n\bar{Y}^2 = \underline{Y}^T\underline{Y} - \frac{1}{n}\underline{J}\underline{Y}$
- $SSE = \underline{Y}^T\underline{Y} - \textbf{H}\underline{Y}$
- $SSR = \underline{Y}^T\textbf{H}\underline{Y} - n\hat{Y}^2 = \underline{Y}^T\textbf{H}\underline{Y} - \frac{1}{n}\underline{J}\underline{Y}$
- $\underline{J}$ is an $n\times n$ matrix of 1s
    
### Matrix Form of Multible Linear Regression
- $\underline{Y} = \textbf{X}\underline{\beta} + \underline{\epsilon}$
    - $\underline{Y}=n\times 1,\;\textbf{X}=n\times q,\;\underline{\beta}=q\times1,\;\underline{\epsilon}=n\times1,\;q=p+1$
    - $\underline{Y} \sim N_n(\textbf{X}\underline{\beta},\;\sigma^2\textbf{I}_n)$
    - $\underline{\epsilon} \sim N(\textbf{0},\;\sigma^2\textbf{I}_n)$
    - $\underline{\hat{\beta}} \sim N(\underline{\beta},\;\sigma^2(\textbf{X}^T\textbf{X})^{-1})$
- $\begin{equation}\begin{bmatrix}
Y_1 \\
Y_2\\
\vdots \\
Y_n
\end{bmatrix} = \begin{bmatrix}
1 & x_{11} & x_{12} & \dots & x_{1p} \\
1 & x_{21} & x_{22} & \dots & x_{2p} \\
\vdots & \vdots  & \ddots & \vdots\\
1 & x_{n1} & x_{n2} & \dots & x_{np} \\
\end{bmatrix} \times \begin{bmatrix}
\beta_0 \\
\beta_1\\
\vdots \\
\beta_p
\end{bmatrix} + \begin{bmatrix}
\epsilon_1 \\
\epsilon_2\\
\vdots \\
\epsilon_n
\end{bmatrix}\end{equation}$

#### Statistics Matrix Properties
- $E(\underline{a} + \textbf{A}\underline{Y}) = \underline{a} + \textbf{A}E(\underline{Y})$
- $var(\underline{a} + \textbf{A}\underline{Y}) = var(\textbf{A}\underline{Y}) = \textbf{A}var(\underline{Y})\textbf{A}^T$ 

#### Matrix Least Squares Estimates
- $\underline{\hat{\beta}}$ minimizes $Q = \sum_{i=1}^n(Y_i - \beta_0 - \sum_{j=1}^p\beta_jx_{ij})^2 = (\underline{Y}-\textbf{X}\underline{\beta})^T(\underline{Y}-\textbf{X}\underline{\beta})$
    - solution is $\textbf{X}^T\textbf{X}\underline{\hat{\beta}} = \textbf{X}^T\underline{Y}$
    - if $\textbf{X}^T\textbf{X}$ is invertible, estimates are $\underline{\hat{\beta}} = (\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\underline{Y}$
    
#### Hat Matrix
- $\textbf{H} = \textbf{X}\underline{\hat{\beta}} = \textbf{X}(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T$
    - $\textbf{H}$ is $n\times n$
    - symmetric: $\textbf{H}^T=\textbf{H}$
    - idempotent: $\textbf{H}^2 = \textbf{H}$
    - trace: $tr(\textbf{H}) = q,\;tr(\textbf{I}-\textbf{H}) = n-q$
        - $\textbf{I}_n-\textbf{H}$ is symmetric and idempotent
- residuals: $\underline{e} = \underline{Y} - \textbf{X}\underline{\hat{\beta}} = (\textbf{I}-\textbf{H})\underline{Y}$
    - $\underline{e} \sim N_n(\textbf{0},\;\sigma^2(\textbf{I}-\textbf{H}))$
- fitted values: $\underline{\hat{Y}} = \textbf{X}\underline{\hat{\beta}} = \textbf{H}\underline{Y}$
    - $\underline{\hat{Y}} \sim N_n(\textbf{0},\;\sigma^2\textbf{H})$
    
### Matrix Form for Polynomial Regression
- $\begin{equation}\begin{bmatrix}
Y_1 \\
Y_2\\
\vdots \\
Y_n
\end{bmatrix} = \begin{bmatrix}
1 & x_1 & x_1^2 & \dots & x_1^h \\
1 & x_2 & x_2^2 & \dots & x_2^h \\
\vdots & \vdots  & \ddots & \vdots\\
1 & x_n & x_n^2 & \dots & x_n^h \\
\end{bmatrix} \times \begin{bmatrix}
\beta_0 \\
\beta_1\\
\vdots \\
\beta_h
\end{bmatrix} + \begin{bmatrix}
\epsilon_1 \\
\epsilon_2\\
\vdots \\
\epsilon_n
\end{bmatrix}\end{equation}$

### Non-Constant Variance
- $\underline{Y} = \textbf{X} + \underline{\beta} + \underline{\epsilon}$ where $\underline{\epsilon} \sim N(\textbf{0},\sigma^2\textbf{W}^{-1}$
    - $\textbf{W} = diag(w_1,...,w_n)$
    - $\sigma^2\textbf{W}^{-1} = \begin{bmatrix}
    \frac{\sigma^2}{w_1} & 0 & \dots & 0 \\
    0 & \frac{\sigma^2}{w_2} & \dots & 0 \\
    \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & \dots & \frac{\sigma^2}{w_n} \\ \end{bmatrix}
    = \begin{bmatrix}
    \sigma_1^2 & 0 & \dots & 0 \\
    0 & \sigma_2^2 & \dots & 0 \\
    \vdots & \vdots & \ddots & \vdots \\
    0 & 0 & \dots & \sigma_n^2 \\
    \end{bmatrix}$
    
Weighted Least Squares Estimates
- $\underline{\hat{\beta}}_{WLS}$ is an unbiased estimator that minimizes $Q_w(\beta) = \sum_{i=1}^nw_i\epsilon_i^2$
- $\underline{\hat{\beta}}_{WLS} = (\textbf{X}^T\textbf{X})^{-1}\textbf{X}^T\textbf{W}\underline{Y}$
- $var(\underline{\hat{\beta}}_{WLS}) = \sigma^2(\textbf{X}^T\textbf{W}\textbf{X})^{-1}$

## Linear Regression
### Examine Full Linear Model
#### Create Model
$E(Y_i)=\beta_0+\beta_1x_{i1}+...+\beta_px_{ip}$

```
# create full linear model
fullLM <- lm(yVar ~ ., data=data)
# summary of the full linear model
summary(fullLM)
```

#### Plots About Predictor Correlation
Scatterplot Matrix
- look at linear and nonlinear relationships between variables
- look at whether relationships are positive or negative <br>

Added Variable Plot
- slope close to 0: predictor contains no additional information for predicting the response
- slope is linear: predictor contains useful information for predicting the response
- slope is nonlinear: predictor contains useful information for predicting the response, but some assumptions are not met

```
# scatterplot matrix
    # method 1
pairs(~ yVar + xVar1 + xVar2 + xVar3)
    # method 2
scatterplotMatrix(~ yVar + xVar1 + xVar2 + xVar3)
# added variable plot
avPlots(fullLM)
```

### Check Assumptions
Assumptions
- the mean response E($Y_i$) is a linear function of $x$
- errors are independent
- errors are normally distributed
- errors have a constant variance

#### Fitted Values and Residuals
- $\hat{Y}_i = \hat{E}(Y_i) = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + ... + \hat{\beta}_px_{ip}$ 
- $e_i = Y_i - \hat{Y}_i$
    - $\sum_{i=1}^ne_i=0$

```
# fitted values
fittedValues <- fit(dollar)fitted.values
# residuals
resid <- fit(dollar)residuals
```

#### Scatterplot of Residuals vs Fitted Values
- linearity: residuals scattered around e=0
- constant variance: residuals form a horizontal band around e=0
- no outliers: no residuals stand out from the basic pattern

```
# scatterplot of residuals vs fitted
    # method 1
plot(fittedValues, resid, xlab='Fitted Values', ylab='Residuals', main='Residuals vs Fitted Values')
abline(h=0, col='red')
    # method 2
plot(fit, which = 1)
```

#### Normal Probability Plot
- normality: Q-Q plot is linear

```
# normal probability plot
    # method 1
qqnorm(resid)
qqline(resid)
    # method 2
plot(fit, which = 2)
# nonconstant variance test, constant variance if p value greater than 0.05
ncvTest(fullLM)
```

### Transform Data
- $Y = a + bx^\alpha$, let $\tilde{x} = x^\alpha$ and $Y = a + b\tilde{x}$
- $Y^\beta = a + bx$, let $\tilde{Y} = Y^\beta$ and $\tilde{Y} = a + bx$

#### Invariance Transformation (for single predictor)
- $\begin{equation}x_i^{(\lambda)}=\begin{cases}\frac{x_i^\lambda-1}{\lambda};\;\lambda \neq 0 \\ln(x_i);\;\lambda=0 \end{cases}\end{equation}$
- $Y_i = \beta_0 + \beta_1x_{i1}^{(\lambda)} + ... + \beta_px_{ip}^{(\lambda)} + \epsilon_i$

```
# invariance transformation
library(car)
invTranPlot(yVar ~ xVar, lambda = c(-1,0,1), optimal = F)
```

#### Power Transform (for multiple predictors)

```
# power transformation
library(car)
predTr <- powerTransform(cbind(xVar1, xVar2, xVar3) ~ 1, data=data)
summary(predTr)
```

#### Box-Cox Transformation (for response)
- $\begin{equation}Y_i^{(\lambda)}=\begin{cases}\frac{Y_i^\lambda-1}{\lambda};\;\lambda \neq 0 \\ln(Y_i);\;\lambda=0 \end{cases}\end{equation}$
- $Y_i^{(\lambda)} = \beta_0 + \beta_1x_{i1} + ... + \beta_px_{ip} + \epsilon_i$
- choose a round number for $\hat{\lambda}$ like $-1, 0, 1, 0.5, ...$ since round numbers are easier to transform back to original units
    - $\hat{\lambda} = 1$ means no transformation is needed to stabilize the variance
    - $\hat{\lambda} = 0$ means a log transformation is needed to stabilize the variance
- try several transformations suggested by the confidence interval

```
# box cox transformation
    # method 1
library(MASS)
bc <- boxcox(lm(fit), lambda=seq(-1,1,by=.1))
bestLambda <- bc(dollar)x[which(bc(dollar)y==max(bc(dollar)y))]
    # method 2
library(car)
bc <- boxCox(lm(fit), lambda=seq(-1,1,by=.1))
bestLambda <- bc(dollar)x[which.max(bc(dollar)y)]
```

### Identify Problematic Datapoints
- Outlier: data point whose response does not follow the trend of the rest of the data.
- High Leverage: data point with extreme predictor values.
    - For multiple predictors, can be extreme for one or more predictors, or unusual combinations of predictors.
- Influential Point: if data point has too much of an influence some part of regression analysis.

#### Outliers
- $t_i$ are studentized residuals
    - $t_i = \frac{e_{ii}}{\hat{\sigma}(i)\sqrt{1-h_{ii}}} = r_i(\frac{n-p-2}{n-p-1-r_i^2})^{1/2}$
        - $r_i = \frac{e_i}{\hat{\sigma}\sqrt{1+h_{ii}}}$ are standardized residuals
    - omits the $i^{th}$ element of Y and the $i^{th}$ row of x
- if $|t_i| > 2$, $Y_i$ is an outlier

```
# find the row with the maximum magnitude
sResid <- rstudent(fit)
pos <- abs(sResid)
which(pos == max(pos))
# test for outliers, gives largest outlier and its studentized residual
outlierTest(fit)
```

#### Leverage Points
- if Cook's distance $D_i>\frac{4}{n-p-1}$, $Y_i$ is a leverage point
    - $D_i = \frac{(\underline{\hat{Y}}-\underline{\hat{Y}}(i))^T(\underline{\hat{Y}}-\underline{\hat{Y}}(i))}{\hat{\sigma}^2(p+1)} = frac{r_i^2\times h_{ii}}{(p+1)(1-h_{ii})}$

```
n <- length(data(dollar)yVar)
# plot Cook's distance against rows to see if there is a leverage point
plot(fit, which=4)
# get values for cooks distance
cooks <- cooks.distance(fit)
which(cooks > 4/(n-p-1)
```

#### Influential Points
- if hat value $h_{ii}>\frac{3(p+1)}{n}$ or $h_{ii}>\frac{2(p+1)}{n}$, $Y_i$ is an influential point
    - $h_{ii} \in [0,1]$ and $\sum_{i=1}^nh_{ii}=p+1$
    - leverage in simple linear regression is $h_{ii} = \frac{1}{nSxx}\sum_{j=1}^n(x_j-x_i)^2$

```
n <- length(data(dollar)yVar)
# find hat values
hats <- hatvalues(fit)
which(hats > 2*sum(hats)/n)
which(hats > 3*sum(hats)/n
```

#### Impact of Points
```
# plot for hat values and cook's distance
influenceIndexPlot(fit, vars=c('hat', 'Cook'), id=TRUE)
# plot beta estimates obtained from dropping data points
betaHatNoti <- influence(fit)(dollar)coefficients
panelFun <- function(x, y, ...){
    points(x, y, ...)
    dataEllipse(x, y, plot.points=FALSE, levels=c(.90))
    showLabels(x, y, labels=rownames(data), method='mahal', n=numberOfPointsExamined)
}
# pairwise scatterplots with all the points labeled
pairs(betaHatNoti, panel=panelFun)
```

#### Remove Point
```
# remove outlier
dataWithoutOutlier <- data[-rowWithOutlier,]
# new linear model without outlier
lmWithoutOutlier <- lm(yVar ~ ., data=dataWithoutOutlier)
```

### Predictor Selection

#### Stepwise Regression
- add or take away predictors until lowest AIC or BIC achieved
- forward and backward selection produce different models <br>
Information Criterion <br>
- evaluates models based on goodness of fit and complexity
- lower values indicate a better model

Akaike's Information Criterion
- $AIC = nln(\frac{SSE}{n}) + 2(p+1)$ <br>

Bayesian Information Criterion
- $BIC = nln(\frac{SSE}{n}) + ln(n)(p+1)$
- places higher penalty on number of predictors than AIC

```
# parameters
modEmpty <- lm(yVar ~ 1, data=data)
modFull <- lm(yVar ~ ., data=data)
n <- length(data(dollar)yVar)
# forward selection with AIC
forwardAIC <- step(modEmpty, scope=list(lower=modEmpty, upper=modFull), direction='forward')
# forward selection with BIC
forwardBIC <- step(modEmpty, scope=list(lower=modEmpty, upper=modFull), direction='forward', k=log(n), trace=0)
# backward selection with AIC
backwardAIC <- step(modFull, scope=list(lower=modEmpty, upper=modFull), direction='backward')
# backward selection with BIC
backwardBIC <- step(modFull, scope=list(lower=modEmpty, upper=modFull), direction='backward', k=log(n), trace=0)
```

#### Regression Subsets
R Squared
- $R^2=\frac{SSR}{SSTO} = n$
- $100 \times x$% of the variability of Y is explained by its linear relationship with predictors
- choose model with the largest increase in R squared <br>

Adjusted R Squared
- $adjusted\;R^2=1-\frac{n-1}{n-p-1}(1-R^2) $
- choose model with the largest adjusted R squared <br>

Mallows Cp Statistic
- $Cp = \frac{SSE_p}{MSE_{full}} + 2q - n$
- choose simplist model with smallest Cp near q=p+1

```
library(leaps)
# get regression subsets object
modReg <- regsubsets(cbind(xVar1, xVar2, xVar3), yVar, data=data)
sumReg <- summary(modReg)
# output best model for 1 to 3 predictors
sumReg(dollar)which
# output R squared
sumReg(dollar)rsq
# output adjusted R squared
sumReg(dollar)adjr2
# output Mallow's Cp statistic
sumReg(dollar)cp
# output BIC
sumReg(dollar)bic
```

### Hypothesis Tests for Multiple Linear Regression
#### Global F Test
Are there significant parameters in the model?
- $H_0:\beta_1=...=\beta_p=0$ vs $H_a:$ at least one $\beta_j \neq 0$ for $j=1,...,p$
    - all parameters are insignificant if p value > 0.05
- test statistic: $F = \frac{MSR}{MSE}$ and under the null hypothesis $F\sim F_{p,\;n-p-1}$

```
# global F test
    # method 1
# look at the F-statistic to see global F test statistic and p value
sumGlobal <- summary(fullLM)
    # method 2
emptyLM <- lm(yVar ~ 1)
anovaGlobal <- anova(emptyLM, fullLM)
anovaGlobal(dollar)'Pr(>F)'[2]
```

#### Partial F Test for One Parameter
- $(t_n)^2 = F_{1,n}$ so the t and F tests are equivalent

Is the third parameter significant?
- since it is one parameter, F and t tests yield the same result
- $H_0: \beta_3=0$ vs $H_a: \beta_3\neq 0$
    - $\beta_3$ is insignificant if p value > 0.05
- test statistic: $T_3=\frac{\hat\beta_3}{SE(\hat\beta_3)}$ and under the null hypothesis $T_3\sim t_{n-p-1}$
    - $SE(\hat\beta_k = \sqrt{MSE[(\textbf{X}^T\textbf{X})^{-1}]_{kk}}$
    
```
# partial F test for one parameter
    # method 1
# look at the Pr(>|t|) column to see the partial F test corresponding to each Bk
sum <- summary(fullLM)
    # method 2
redB3LM <- lm(yVar ~ xVar1 + xVar2)
anovaTestB3 <- anvoa(redB3LM, fullLM)
anovaTestB3(dollar)'Pr(>F)'[2]
```

#### Partial F Test for Multiple Parameters
Are the second and/or third parameters significant? 
- $H_0: \beta_2=\beta_3=0$ vs $H_a:$ at least one $\beta_j\neq 0$ for $j=2,3$
    - $\beta_2$ and/or $\beta_3$ is insignificant if p value > 0.05
- test statistic: $F=\frac{\frac{SSE_{red}-SSE{full}}{df_{red}-df_{full}}}{\frac{SSE_{full}}{df_{full}}}$ and under the null hypothesis $F \sim F_{number\;of\;predictors,\;n-p-1}$

```
# partial F test for multiple parameters
redB2B3LM <- lm(yVar ~ xVar1)
anovaTestB2B3 <- anova(redB2B3LM, fullLM)
anovaTestB2B3(dollar)'Pr(>F)'[2]
```

## Final Model 

### ANOVA Table for Multiple Linear Regression
|Sources of Variation|Sum of Squares|Degrees of Freedom|Mean Square Value|F Observed|
|----|----|----|----|----|
|Regression|$SSR$|$p$|$MSR = \frac{SSR}{p}$|$F=\frac{MSR}{MSE}$|
|Error|$SSE$|$n-p-1$|$MSE = \frac{SSE}{n-p-1}$| - |
|Total|$SSto$|$n-1$| - | - |

### Sequential Sum fo Squares
|Sources of Variation|Sum of Squares Value|Degrees of Freedom|Mean Square Value|F Observed Value|Null Hypothesis|
|----|----|----|----|----|----|
|$x_1$|$SSR(x_1)$|$1$|$MSR(x_1)$|$F=\frac{MSR(x_1)}{MSE}$|$\beta_1$ = 0|
|$x_2|x_1$|$SSR(x_2|x_1)$|$1$|$MSR(x_2|x_1)$|$F=\frac{MSR(x_2|x_1)}{MSE}$|$\beta_2$ = 0|
|$x_3|x_2,\;x_1$|$SSR(x_3|x_2,x_1)$|$1$|$MSR(x_3|x_2,x_1)$|$F=\frac{MSR(x_3|x_2,x_1)}{MSE}$|$\beta_3$ = 0|
|Error|$SSE(x_1,x_2,x_3)$|$n-p-1$|$MSE$| - | - |
|Total|$SSto$|$n-1$| - | - | - |

- the increase in regression sum of squares when one or more predictors are added to the model
    - $SSR(x_2|x_1)$: increase in sum of squares from adding $x_2$ to a model containing $x_1$
        - $SSR(x_2|x_1) = SSR(x_2,x_1) - SSR(x_1)$
    - $SSR(x_3,x_2|x_1)$: increase in sum of squares from adding $x_3$ and $x_2$ to model containing $x_1$
    - $SSR(x_3|x_2,x_1)$: increase in sum of squares from adding $x_3$ to model containing $x_2$ and $x_1$
- regression sum of squares
    - $SSR(x_1)$ is the regression sum of squares when $x_1$ is in the model
    - $SSR(x_1,x_2,x_3)$ is the regression sum of squares when $x_1$, $x_2$, and $x_3$ are in the model
- calculations with sequential sums of squares
    - $SSR(x_1,x_2,x_3)=SSR(x_1)+SSR(x_2|x_1)+SSR(x_3|x_2,x_1)$
    - $SSto = SSR(x_1,x_2,x_3) + SSE(x_1,x_2,x_3) = SSR(x_1,x_2) + SSE(x_1,x_2) = SSR(x_1) + SSE(x_1)$
    
```
# sequential sum of squares
model <- lm(yVar ~ x1Var + x2Var + x3Var)
# outputs x1, x2|x1, and x3|x2, x1
summary(model) 
```

### Least Squares Estimates for Coefficients
- estimates minimize $Q=\sum_{i=i}^n(Y_i-\beta_0-\beta_1x_{i1}-...-\beta_px_{ip})^2$
    - take partial derivatives with respect to each $\beta$
    - set partial derivative equal to 0 and solve to find critical values
    - check the second derivative is greater than 0 to ensure it's a minimum
- confidence interval: $\hat\beta_k \pm t_{\alpha/2,\;n-p-1}SE(\hat\beta_k)$
    - $\beta_0$ is mean response for $E(Y_i)$ when $x_{ij}=0$ for $j = 1,...,p$
    - if $x_{ij}$ increases by 1 unit, the mean response $E(Y_i)$ changes by $\beta_j$ when all predictors are held constant

```
    # method 1
# output coefficients
(finalModelSum <- summary(finalModel))
# confidence interval for coefficents
confint(finalModel, level=0.95)
    # method 2
# output coefficients
finalModel(dollar)coefficients
# standard error for coefficients
finalModelSum(dollar)coefficients[,"Std.Error"]
```

### New Responses
- $(\underline{x}_0, Y_0)$ is a new response where $underline{x}_0^T=[x_{01},...,x_{0p}]$ is fixed
    - assume $Y_0 = \beta_0 + \beta_1x_{01} + ... + \beta_px_{0p} + \epsilon_0$
    - $Y_0 \sim N(\beta_0 + \beta_1x_{il} + ... + \beta_px_{ip}, \sigma^2)$ and $Y_0$ independent of $Y_1,...,Y_n$
    - $\epsilon_0 \sim N(0, \sigma^2)$ and $\epsilon_0$ independent of $\epsilon_1,...,\epsilon_n$
- $\hat{E}(Y_0) = \hat{Y}_0 = \hat{\beta}_0 | \sum_{j=1}^p\hat{\beta}_jx_{0j}$

#### Confidence Interval for Mean Response
- $\hat{Y}_0 \pm t_{\alpha/2,\;n-p-1}SE(\hat{Y}_0)$
- $SE(\hat{Y}_0)=\sqrt{MSE[1\;\underline{x_0^T}](\textbf{X}^T\textbf{X})^{-1}\begin{bmatrix}1\\\underline{x_0}\end{bmatrix}}$ 
- if data is transformed so $\tilde{Y} = ln(Y_0)$, $\hat{E}(Y_0)$ cannot be used to estimate the mean
    - $\hat{E}(Y_0)$ is used to estimate the median
    - therefore, interpret the confidence interval in terms of the median response instead of mean response

```
# mean response
newData <- data.frame(xVar=value)
predict(finalModel, newdata=newData, interval='confidence', level=0.95)
```

#### Prediction Interval for New Response
- $\hat{Y}_0 \pm t_{\alpha/2,\;n-p-1}SPE(\hat{Y}_0)$
- $SPE(\hat{Y}_0)=\sqrt{MSE(1+[1\;\underline{x_0^T}](\textbf{X}^T\textbf{X})^{-1}\begin{bmatrix}1\\\underline{x_0}\end{bmatrix})}$ 

```
# new response
newData <- data.frame(xVar=value)
predict(finalModel, newdata=newData, interval='prediction', level=0.95)
```

#### Coefficient Interpretation
- $E(Y_i) = \beta_0 + \beta_1x_{i1} + ... + \beta_kx_{ik} + ... + \beta_px_{ip}$
    - increase $x_{0k}$ by $1$ unit so $\underline{x}_0^* = [x_{01} ... x_{0k}+1 ... x_{0p}]^T$
        - $E(Y_0)$ changes by $\beta_k$ 
        - $E(Y_0^*) = \beta_0 + \beta_1x_{01} + ... + \beta_k(x_{0k}+1) + ... + \beta_px_{0p} = E(Y_0) + \beta_k$
- $E(Y_i) = \beta_0 + \beta_1x_{i1} + ... + \beta_kln(x_{ik}) + ... + \beta_px_{ip}$
    - change $x_{0k}$ by $100\times p$ where $p \in (-1,1)$ so $\underline{x}_0^* = [x_{01} ... ln((1+p)x_{0k}) ... x_{0p}]^T$
        - $E(Y_0)$ changes by $\beta_kln(1-p)$ 
        - $E(Y_0^*) = \beta_0 + \beta_1x_{01} + ... + \beta_kln((1-p)x_{0k}) + ... + \beta_px_{0p} = E(Y_0) + \beta_kln(1-p)$
     - 10 fold increase in $x_{0k}$  so $\underline{x}_0^* = [x_{01} ... ln(10x_{0k}) ... x_{0p}]^T$
        - $E(Y_0)$ changes by $\beta_kln(10)$ 
        - $E(Y_0^*) = \beta_0 + \beta_1x_{01} + ... + \beta_kln((10x_{0k}) + ... + \beta_px_{0p} = E(Y_0) + \beta_kln(10)$
- $E(ln(Y_i)) = \beta_0 + \beta_1x_{i1} + ... + \beta_kx_{ik} + ... + \beta_px_{ip}$
    - increase $x_{0k}$ by $1$ unit so $\underline{x}_0^* = [x_{01} ... x_{0k}+1 ... x_{0p}]^T$
        - median of $Y_0$ changes by a factor of $e^{\beta_k}$ 
        - $E(ln(Y_0^*)) = \beta_0 + \beta_1x_{01} + ... + \beta_k(x_{0k}+1) + ... + \beta_px_{0p} = E(ln(Y_0)) + \beta_k$
        - $E(ln(Y_0^*)) \rightarrow ln(median(Y_0^*)$ and $E(ln(Y_0)) \rightarrow ln(median(Y_0)$
        - then $ln(median(Y_0^*) = ln(median(Y_0) + \beta_k \rightarrow \beta_k = ln(\frac{median(Y_0^*)}{median(Y_0)}) \rightarrow median(Y_0^*) = e^{\beta_k}\times median(Y_0)$
- $E(ln(Y_i)) = \beta_0 + \beta_1x_{i1} + ... + \beta_kln(x_{ik}) + ... + \beta_px_{ip}$
    - change $x_{0k}$ by $100\times p$ where $p \in (-1,1)$ so $\underline{x}_0^* = [x_{01} ... ln((1+p)x_{0k}) ... x_{0p}]^T$
        - median of $Y_0$ changes by a factor of $(1-p)^{\beta_k}$
        - $E(ln(Y_0^*)) = \beta_0 + \beta_1x_{01} + ... + \beta_kln((1-p)x_{0k}) + ... + \beta_px_{0p} = E(ln(Y_0)) + \beta_kln(1-p)$
        - $E(ln(Y_0^*)) \rightarrow ln(median(Y_0^*)$ and $E(ln(Y_0)) \rightarrow ln(median(Y_0)$
        - then $ln(median(Y_0^*) = ln(median(Y_0) + \beta_kln(1-p) \rightarrow \beta_kln(1-p) = ln(\frac{median(Y_0^*)}{median(Y_0)}) \rightarrow median(Y_0^*) = (1-p)^{\beta_k}\times median(Y_0)$

## Polynomial Regression

### Polynomial Regression Models
#### Single Predictor
$E(Y_i)=\beta_0+\beta_1x_i+\beta_2x_i^2+...+\beta_hx_i^h$

```
# linear model with single predictor
linearLM <- lm(yVar ~ xVar)
# quadratic model with single predictor
quadraticLM <- <- lm(yVar ~ xVar + I(xVar^2))
# cubic model with single predictor
cubicLM <- lm(yVar ~ xVar + I(xVar^2) + I(xVar^3))
```

#### Multiple Predictors
$E(Y_i)=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_{11}x_{i1}^2+\beta_{22}x_{i2}^2+\beta_{12}x_{i1}x_{i2}$
```
# quadratic model with multiple predictors
multQuadLM <- lm(yVar ~ xVar1 + I(xVar1^2) + xVar2 + I(xVar2^2) + xVar1:xVar2)
# cubic model with multiple predictors
multCubeLM <- lm(yVar ~ xVar1+ I(xVar1^2) + I(xVar1^3) + xVar2 + I(xVar2^2) + I(xVar2^3) + xVar1:xVar2 + 
                I(xVar1^2):xVar2 + I(xVar2^2):xVar1)
```

#### Hierarchy Principle
- if $x^h$ is in the model, then the model must include all $x^j$ for $0 \leq j \leq h$, whether or not lower order terms are significant

### Polynomial Regression Hypothesis Tests
#### Single Predictor
Can the degree of the model be reduced?
- $H_0: \beta_3=0$ vs $H_a:\beta_3\neq0$

```
# single predictor polynomial hypothesis test
cubicNeededTest <- anova(quadraticLM, cubicLM)
```

#### Multiple Predictors
Does the first predictor have an effect?
- $H_0: \beta_1=\beta_{11}=\beta_{12}=0$ vs $H_a:$ at least one $\beta_j\neq0$ for $j=1,\;11,\;12$

```
# reduced model
secondPredOnlyLM <- lm(yVar ~ xVar2 + I(xVar2^2))
# test for effect of first predictor
anvoa(secondPredOnlyLM, multQuadLM)
```

## Categorical Variables
### Encoding Categorical Variables
```
# encoding categorical variables
    # method 1
x2 <- ifelse(as.character(data(dollar)catVar) == 'val1', 1, 0)
x3 <- ifelse(as.character(data(dollar)catVar) == 'val2', 1, 0)
    # method 2, best if have multiple categorical variables
catLM <- lm(yVar ~ as.factor(catVar), data=data)
catMat <- model.matrix(catLM)
```
### Parallel Model
#### Equation
$E(Y_i)=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3}$
- predictor: $x_{i1}$
- categorical predictors: $x_{i2},\;x_{i3}$

#### Categorical Variables
- treatments: $A_i=(1,\;0),\;B_i=(0,\;1),\;C_i=(0,\;0)$
    - $\begin{equation}x_{i2}=\begin{cases}1 &\text{if in A}\\0 &\text{otherwise} \end{cases}\end{equation}$
    - $\begin{equation}x_{i3}=\begin{cases}1 &\text{if in B}\\0 &\text{otherwise} \end{cases}\end{equation}$
- first order categorical means:
    - A: $\beta_0 + \beta_2 + \beta_1x_{i1}$
    - B: $\beta_0 + \beta_3 + \beta_1x_{i1}$
    - C: $\beta_0 + \beta_1x_{i1}$
    
#### Coefficient Interpretation
- $\beta_1$ = slope for predictor after controlling for categorical variable
- $\beta_0$ = mean for C for any predictor value
- $\beta_2$ = mean difference between A and C for any predictor value
- $\beta_3$ = mean difference between B and C for any predictor value
    
```
# parallel model
parallelLM <- lm(yVar ~ xVar + as.factor(catVar))
# matrix to show which categorical values correspond to each coefficient
model.matrix(parallelLM)
```

### Nonparallel Model
#### Equation
$E(Y_i)=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\beta_3x_{i3}+\beta_{12}x_{i1}x_{i2}+\beta_{13}x_{i1}x_{i3}$
- predictor: $x_{i1}$
- categorical predictors: $x_{i2},\;x_{i3}$

#### Categorical Variables
- treatments: $A_i=(1,\;0),\;B_i=(0,\;1),\;C_i=(0,\;0)$
    - $\begin{equation}x_{i2}=\begin{cases}1 &\text{if in A}\\0 &\text{otherwise} \end{cases}\end{equation}$
    - $\begin{equation}x_{i3}=\begin{cases}1 &\text{if in B}\\0 &\text{otherwise} \end{cases}\end{equation}$
- second order categorical means: 
    - A: $\beta_0 + \beta_2 + (\beta_1+\beta_{12})x_{i1}$
    - B: $\beta_0 + \beta_3 + (\beta_1+\beta_{13})x_{i1}$
    - C: $\beta_0 + \beta_1x_{i1}$
    
#### Coefficient interpretation
- $\beta_2$ are $\beta_3$ differences in intercepts
- $\beta_{12}$ are $\beta_{13}$ differences in slopes

```
# nonparallel model
nonparallelLM <- lm(yVar ~ xVar + as.factor(catVar) + xVar*as.factor(catVar))
# matrix to show which categorical values correspond to each coefficient
model.matrix(nonparallelLM)
```

### Hypothesis Tests for Models with Categorical Variables
Is there a significant linear relationship between predictors and responses for all groups of the categorical variable?
- $H_0:\beta_1=0$ vs $H_a:\beta_a\neq0$

```
# reduced model
catOnlyLM <- lm(yVar ~ as.factor(catVar))
# test for linear relationship
anova(catOnlyLM, parallelLM)
anova(catOnlyLM, nonparallelLM)
```

Is there a significant difference in the mean response of treatments A, B, and C for any predictor?
- parallel
    - $\beta_{2}=\beta_{3}=0$ vs $H_a:$ at least one $\beta_{j}\neq0$ for $j=2,\;3$
- nonparallel
    - $\beta_2=\beta_3=\beta_{12}=\beta_{13}=0$ vs $H_a:$ at least one $\beta_{j}\neq0$ for $j=2,\;3,\;12,\;13$

```
# reduced model
noCatLM <- lm(yVar ~ xVar)
# test for difference in mean response
anova(noCatLM, parallelLM)
anova(noCatLM, nonparallelLM)
```

Is there a significant interaction term between the predictor and categorical variables?
- $\beta_{12}=\beta_{13}=0$ vs $H_a:$ at least one $\beta_{1j}\neq0$ for $j=2,\;3$

```
# test for interaction effect
anova(parallelLM, nonparallelLM)
```