# Chapter 3 Lab: Diagnostics and Remedial Measures (25 points total)

The goal of this lab assignment is to walk you through a complete simple linear regression analysis involving 

<ul>
    <li> specification of a functional relationship,  </li>
    <li> regression diagnostics,  </li>
    <li> and statistical inference.  </li>
</ul>

The lab will be submitted as the next homework assignment (see Canvas for details).

### (1 point)
### 1. Load the CDI dataset

In [None]:
options(repr.plot.width=5, repr.plot.height=5)

# read in data

# view first few lines of data

### (1 point)
### 2. Visualize data

As before, we are interested in the relationship between personal income (X) and number of active physicians (Y).

- Create a scatterplot visualizing this relationship.
- Visualize the distribution of predictor and response variables using some of the visualizations described in class (e.g., boxplot(), hist(), qqnorm(), etc.)

### (1 point)
### 3. Visual Observations

- What type of functional relationship do you visually observe from the scatterplot?
- What is your observation of the distribution of the predictor and response variables?

### (1 point)
### 4. Regression

- Regress number of active physicians (Y) on personal income (X) using the lm() function
- Assume a linear functional relationship
- Store the regression as 'fit'

### (2 points)
### 5. Diagnostics

- Create diagnostics plots using the plot() function (e.g. $\texttt{plot(fit)}$)
- Evaluate the first three plots (residuals vs fitted, Normal Q-Q, and Scale-Location)
- Which of the assumptions is satistified? Which are violated? Why?
- Are we able to conduct statistical inference on the parameters?
- Please be thorough with your analysis!

In [None]:
#example code to view first three plots
options(repr.plot.width=10, repr.plot.height=3.5)
par(mfrow=c(1,3))
plot(fit, which=c(1,2,3))

### (4 points)
### 5. Test for presence of heteroskedasticity

- Use the Brown-Forsythe test for heteroskedasticity [$\texttt{leveneTest()}$]
  - Must install package [$\texttt{install.packages("car")}$] 
  - Note that Brown-Forsythe is the leveneTest but using medians. We prefer this for robustness
  - Also note the test in R is slightly different from what we went over in class, but hypotheses are still the same
  - What is your conclusion? Do we satisfy the heteroskedasticity assumption?
  
  
- Are we allowed to use the Breuch-Pagan test for heteroskedasticity? Why or why not?


In [None]:
#install.packages("car")

In [None]:
library(car)

# you can play around with this cutoff (used to define two groups)
cutoff <- median(cdi$income_personal)

# We are interested in heteroskedasticity in the residuals
resid <- fit$residuals
# Divides predictor variable into two groups
var_group <- factor(cdi$income_personal < cutoff)

# create a dataframe to perform test
bf_dat <- data.frame(resid, var_group)
leveneTest(resid ~ var_group, data = bf_dat, center = median)

### (4 points)
### 6. Transformations by Hand (without Box-Cox Transformation)

- Experiment with transformations on the predictor and response variables
- Re-visualize distributions of transformed variables
- Run regressions
- Re-perform diagnostics to evaluate model assumptions
- Have fun with the data!

- What do you observe? Where are you having trouble? Which assumptions are you still violating?

### (2 points)
### 7. Breusch Pagan Test for Heteroskedasticity

- Use the Breusch-Pagan test to test for heteroskedasticity [$\texttt{bptest()}$]
  - Must install package [$\texttt{install.packages("lmtest")}$] 
  - Perform the breusch-pagan test with some of the transformations you performed above
  - Does your final proposed model satisfy the heteroskedasticity assumption?

In [None]:
#install.packages("lmtest")

# load library
library(lmtest)

In [None]:
# run BP test
bptest(fit)

### (2 points)
### 8. Lowess Smooth

- Plot your transformed data using a scatterplot
- Overlay a lowess smooth onto the scatterplot
- What functional relationship do you observe?

In [None]:
trans_phys <- # your transformed response
trans_inc <-  # your transformed predictor

# plot data
options(repr.plot.width=5, repr.plot.height=5)
plot(trans_inc, trans_phys,
    xlab = "Personal Income (transformed)", ylab = "Number of Active Physicians (transformed)")

# overlay regression line
abline(lm(trans_phys~trans_inc))

# overlay lowess smooth
lines(lowess(trans_inc, trans_phys, f = 0.2), col="red", lwd=2)

### (3 points)
### 9. Box-Cox Transformations

- Use Box-Cox to assist you in finding the proper transformation
  - Requires "car" package

In [None]:
# fit original regression
fit <- lm(number_active_physicians~income_personal, data=cdi)

# use Box-Cox
bc = boxCox(fit)

# select lambda that maximizes log-likelihood
pow <- bc$x[which.max(bc$y)]

### (3 points)
### 10. Complete Analysis

- Run your final model
- Show diagnostic plots and determine if assumptions are satisfied
- Perform a statistical test on the slope parameter
- (Attempt to) interpret your findings