# **Exercise Session 10**
# Developed by Biljana Jonoska Stojkova, PhD
# Revised by Johnson Chen 


## **Lecture 10 - Estimating Strength and Direction of Relationships Between Variables: Assumptions, Diagnostics, and Multiple Linear Regression**

Today's exercise will build on the defined research questions from Lecture 9. We will practice determining how trustworthy the results from the analysis are. We will also see an example of a research question that adds covariates and uses a multiple linear regression model.

We will keep the same teams as those on the previous days. Each team will continue working on the `mtcars` dataset. Each team member will have to answer the questions and upload their Jupyter Notebook on Canvas.

### **Today's Learning Goal:**

- Assess the trustworthiness of the analysis results conducted in Lecture 9.
- Evaluate the assumptions of correlation analysis applied to research questions from Lecture 9.
- Evaluate the assumptions of simple linear regression applied to research questions from Lecture 9.
- Create diagnostic plots to assess the assumptions of the methods above and interpret them.
- Learn how to define research questions to adjust the main effects of interest with additional covariates.
- Ensure you upload your Jupyter Notebook at the end of the day.

### **Tasks for All Teams**

You will continue working within your team.

- Pull up the lecture notes with research questions from Lecture 9.
- Run diagnostics for the correlation research questions.
- Run diagnostics for the unidirectional relationships defined by research questions.
- Interpret the diagnostic results.
- Define research questions to adjust the main effects of interest with additional covariates.

#### **Assumptions of Pearson Correlation**

- **Linear Relationship:** There exists a linear relationship between the variables x and y.
  
- **No Outliers:** Extreme outliers can affect the results.

- **Related Pairs:** Each observation has two values, one for the variable x and one for the variable y.

- **Normality:** Both variables should be approximately normally distributed.

- **Interval or Ratio Type of Variables:** Both variables should be interval or ratio types of data.

#### **Assumptions of Linear Regression**

- **Linear Relationship:** There exists a linear relationship between the variables x and y.
  
- **Independence:** Observations should be measurements taken from independent observational or experimental units (i.e., true replicas).

- **Homoscedasticity:** The residuals have constant variance at every level of the variable x.

- **Normality of the Residuals:** Residuals should be approximately normally distributed.

- **No Influential Outliers:** Extreme Influential outliers can affect the results.

In [None]:
#Run this code
library(tidyverse)
data("mtcars")
head(mtcars)

In [None]:
#Let's pull up R help to learn about the mtcars data set
?mtcars

### **Teams 1 - 11:**

Your task is to examine the assumptions of the appropriate statistical methods for the statistical hypotheses you defined in Lecture 9. Start by examining the dataset, running the code cells, and discussing with your team.

**Q1.** Consider the statistical problem you formulated in Q2 in Lecture 9. Assess and interpret the assumptions of Pearson correlation.

**Q2.** Consider the statistical problem you formulated in Q6 in Lecture 9. Assess and interpret the assumptions of the linear model.

**Q3.** Bonus question: Can you formulate a statistical problem where you can estimate the impact of the main effect on the response variable, adjusted for other variables? Formulate the statistical problem and suggest an appropriate method for solving this statistical problem.

Hint: Covariates can be added in a multiple regression model to adjust the main effect of interest on the response.


**Answer Q1: A1 Linear relationship, there exists a linear relationship between the variables x and y.**

In [None]:
#Run this code for question Q1 (assess linear relationship)
g1 = mtcars %>% ggplot(aes(y=mpg, x=hp)) + geom_point() + 
     labs(title="Scatterplot of horse power and miles per gallon",x="Horse Power (hp)",y="Miles Per Gallon (mpg)") + 
     geom_smooth(method="lm")
g1


**Answer Q1: A2 No outliers, extreme outliers can affect the results.**

In [None]:
ds_long = mtcars %>% gather(variable, value,mpg,hp)

g2 = ds_long %>% ggplot(aes(x=value)) + geom_histogram() + 
     labs(title="Histograms of horse power and miles per gallon",x="Variable",y="Frequency") + 
     facet_wrap(~variable,scale="free")
g2


**Answer Q1: A3 Related Pairs, each observation has two values, one for the variable x and one for the variable y.** Answer this based on your understanding about the data types.


In [None]:
head(mtcars[,c("mpg","hp")])

**Answer Q1: A4 Normal, both variables shold be approximately normally distributed.**

In [None]:
g3 = ds_long %>% ggplot(aes(sample=value)) + stat_qq()  + stat_qq_line(col = "red")+
     labs(title="QQnorm plots of horse power and miles per gallon",x="Theoretical distribution",y="Variable") + 
     facet_wrap(~variable,scale="free")
g3

**Answer Q1: A5 Interval or ratio type of variables, both variables shold be interval or ratio types of data.** 

Answer this based on your understanding about the data types.

In [None]:
#Run this code to Answer Q2 to assess the linear regression model assumptions
m1= lm(mpg~hp,data=mtcars)
summary(m1)
par(mfrow = c(2, 2))
plot(m1)

**Answer Q2: A1 Linear relationship, there exists a linear relationship between the variables x and y.** 

Residuals versus fitted plot, we look for no patterns in the residuals.


**Answer Q2: A2 Independence, observations should be measurements taken from independent observational or experimental units (i.e., true replicas).** 

Answer this based on your understanding about the data types.


**Answer Q2: A3 Homoscedasticity, the residuals have constant variance at every level of the variable x.** 

Interpret the Scale-Location plot. We look for no patterns in the residuals.


**Answer Q2: A4 Normality of the residuals, residuals shold be approximately normally distributed.** 

QQnorm plot for the normality of the residuals. We look for how well the residuals are matching the theoretical normal distribution.


**Answer Q2: A5 No influential outliers.** 

Residuals versus leverage plot, look for points outside the estimated Cook distances with dashed lines.

**Answer Q3:**

Hint: If we wanted to estimate the effect of horsepower (hp) on miles per gallon (mpg), adjusted for displacement (disp), formulate the statistical problem and explain a potential appropriate method for analysis. Consider using a multiple linear regression model. Try to write out the R model formula to include displacement, `mpg ~ hp`.

**Upload your work from Lecture 10 Exercise session**

**Note.** Jupiter Notebook is acceptable for Class participation mark. 
          Please make sure you save your JupiterNotebook with Answers.

- Each student will upload the Jupiter Notebook on Canvas Course 1: https://canvas.ubc.ca/courses/144703:

 `[Lecture_10_Exercise_Session 10]_[TeamNumber]_[student name].ipynb`
eg., `Lecture_10_Exercise_Session 10_Team21_Biljana_Jonoska_Stojkova.ipynb`

- Please write at the title who was responsible for writing each paragraph. 

Navigate to the Assignments section on Canvas Course 1, and upload the Jupiter document on Canvas under:
`Class Participation\Lecture 10 - Estimating strength and direction of relationships between variables: Assumptions, diagnostics and multiple linear regression` 
