# DS105 Intermediate Statistics : Lesson Seven Companion Notebook

### Table of Contents <a class="anchor" id="DS105L7_toc"></a>

* [Table of Contents](#DS105L7_toc)
    * [Page 1 - Introduction to ANCOVA](#DS105L7_page_1)
    * [Page 2 - ANCOVAs](#DS105L7_page_2)
    * [Page 3 - ANCOVA Setup in R](#DS105L7_page_3)
    * [Page 4 - ANCOVAs in R](#DS105L7_page_4)
    * [Page 5 - ANCOVA Activity](#DS105L7_page_5)
    * [Page 6 - ANCOVA Activity Solution](#DS105L7_page_6)
    * [Page 7 - Key Terms](#DS105L7_page_7)
    * [Page 8 - Lesson 7 Hands-On](#DS105L7_page_8)
    * [Page 9 - Lesson 7 Hands-On Solution](#DS105L7_page_9)    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction to ANCOVAs<a class="anchor" id="DS105L7_page_1"></a>

[Back to Top](#DS105L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: ANCOVAs
VimeoVideo('388137895', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L07overview.zip)**.

# Introduction

Now that you have the basics of ANOVAs down, you will begin to explore some of the more advanced ANOVA variations.  The ANOVA family of analyses are endlessly diverse, which make them a great tool to have in your pocket! You can have as many independent variables as you like, as long as you have the sample size for it, and you can add it covariates that will help control for factors that may influence how your IV and your DV are related.

As you may have discovered last lesson, R is really the most appropriate program to run ANOVA tests in.  Therefore, this lesson you will dig into within subjects ANOVAs, which help you look at changes over time, mixed-measures ANOVAs, which have both a within and between subjects element to them, and ANCOVAs, which stands for *Analysis of Covariance*, and will include one or more covariates.

By the end of this lesson, you will be able to:
* Test for assumptions and compute an ANCOVA in R using the linear modeling function

This lesson will culminate in a hands-on that allows you to put your skills to use analyzing honey sales data over through the years.


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - ANCOVAs<a class="anchor" id="DS105L7_page_2"></a>

[Back to Top](#DS105L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: ANCOVAs
VimeoVideo('340999354', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L07pg2tutorial.zip)**.



*ANCOVA* stands for *analysis of covariance*.  It is an analysis in the family of ANOVAs, and the big difference is that "C." An ANCOVA takes into account, or adjusts for, yet another factor in your model, aptly named a *covariate*.  Put another way, an ANCOVA controls for the changes that might come up naturally.  For instance, men are slightly better than women at spatial and analytic reasoning on average.  If you were trying to test out a new method of studying for a math class, and you tested it on both men and women, you would want to make sure to control for the natural effect that gender might have in mathematical reasoning. You could test the spatial and analytic reasoning skills at pre-test, teach everyone to study with your new method, and then test spatial and analytic reasoning skills at post-test.  If you collected information from everyone about their gender, you could then use it to control for the natural differences between men and women, using it as a covariate.

Covariates are typically continuous, but it is possible to use categorical covariates if you dummy code them.

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>If you would like to play around more with all the different variations of factorial ANOVAs, feel free to check out <a href="https://www.uvm.edu/~dhowell/StatPages/R/RepeatedMeasuresAnovaR.html">this website</a></p>
    </div>
</div>

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - ANCOVA Setup in R<a class="anchor" id="DS105L7_page_3"></a>

[Back to Top](#DS105L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [3]:
from IPython.display import VimeoVideo
# Tutorial Video Name: ANCOVAs Part I
VimeoVideo('340999354', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L07pg3-4tutorial.zip)**.

# ANCOVA Set Up in R

Now that you have learned a little bit about covariates, it's time to put them into use! But of course, it's important to get all your ducks in a row first.

---

## Load Libraries

There are several libraries you will need to load in order to complete the ANCOVA process. ```rcompanion```, as you know by now, is for assessing normality with easy, best-fit histograms. The ```car``` library is both for assessing homogeneity of variance and for dealing with violations of said assumption. The ```effects``` library will assist you in the creation of means adjusted by your covariate, and the ```multcomp``` package is for conducting post hocs for ANCOVAs.

```{r}
library("rcompanion")
library("car")
library("effects")
library("multcomp")
```

---

## Load in Data

The data you will be conducting ANCOVAs upon is data about admissions into a graduate school program.  It is located **[here](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/graduate_admissions.zip)**. 

---

## Question Set Up

You will answer the following question with your ANCOVA: 

```text
Controlling for students' research participation in undergrad, does the rating of the students' undergraduate university impact their chance of admittance into graduate school? 
```

For this question, the covariate is students' research participation in undergrad (```Research```), since that is what you are controlling.  The IV is the categorical variable, rated 1-5, of students' undergraduate university (```University.Rating```), and the DV is the chance of admittance into graduate school (```Chance.of.Admit```).

---

## Data Wrangling

With this particular dataset, there is very little data wrangling required. All that is necessary is to ensure that all categorical variables are factors, not integers, since the data type will throw off the code for getting adjusted means and post hocs later (the ANCOVA itself will take the data in any data type).

---

### Ensure the IV is a Factor 

You can see what format the data is in for your IV by using the ```str()``` function: 

```{r}
str(graduate_admissions$University.Rating)
```

Note that the output shows it is an integer! That can easily be converted: 

```{r}

graduate_admissions$University.Rating <- as.factor(graduate_admissions$University.Rating)
```

---

### Ensure the CV is a Factor

Covariates can either be categorical or continuous.  Either is perfectly acceptable.  However, if you do use a categorical CV, like ```Research``` is, it needs to be formatted as factor data, not as an integer.  You can test and fix this just like the IV: 

```{r}
str(graduate_admissions$Research)
graduate_admissions$Research <- as.factor(graduate_admissions$Research)
```

You have now completed all data wrangling activities for this particular ANCOVA!

---

## Testing Assumptions

The assumptions for an ANCOVA are similar to those for your basic ANOVAs. However, one assumption is added - the assumption of *homogeneity of regression slopes*, which tests for whether the predictor variable (DV) and the covariate are independent of each other. This is because you are controlling for the presence of the covariate, not determining whether it has an effect on your DV. If the covariate is related in any way to your DV, then most likely it should be an independent variable, not a covariate, because it is explaining some variance in your model.

---

### Normality

Is this one starting to get old hat yet? Take a look at the normality of your continuous variable, ```Chance.of.Admit```.

```{r}
plotNormalHistogram(graduate_admissions$Chance.of.Admit)
```

Here is the resulting figure: 

![A bar graph depicts the plot of x against the frequency. The x-axis represents x and the y-axis represents frequency. The x-axis ranges from 0.3 to 1.0. The y-axis ranges from 0 to 60. Fourteen bars and a curve are plotted.](Media/ANCOVA5.png)

It could probably be a little more normal, though it's not too bad.  Try a square transformation: 

```{r}
graduate_admissions$Chance.of.AdmitSQ <- graduate_admissions$Chance.of.Admit * graduate_admissions$Chance.of.Admit
plotNormalHistogram(graduate_admissions$Chance.of.AdmitSQ)
```

It looks like that was a pretty good choice, huh? 

![A bar graph depicts the plot of x against the frequency. The x-axis represents x and the y-axis represents frequency. The x-axis ranges from 0.2 to 1.0. The y-axis ranges from 0 to 80. Fourteen bars and a curve are plotted. The curve is exactly placed on the bars.](Media/ANCOVA6.png)

Your data now meets the assumption of normality.

---

### Homogeneity of Variance

To test for homogeneity of variance, you will run Levene's Test.  Unfortunately, however, Levene's test will not take into account any categorical covariates (though they are perfectly valid).  So you'll include the model as best you can without it: 

```{r}
leveneTest(Chance.of.AdmitSQ~University.Rating, data=graduate_admissions)
```

Here is the result: 

```text
Levene's Test for Homogeneity of Variance (center = median)
       Df F value  Pr(>F)  
group   4  2.4283 0.04734 *
      395                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Unfortunately, this test is significant, which means that you have violated the assumption of homogeneity of variance.  You will learn how to correct for this violation on the next page.

---

### Homogeneity of Regression Slopes

In order to test for homogeneity of regression slopes, you can run a one-way ANOVA, with your covariate as the IV and the DV you were planning to use for your ANCOVA. If the *F* test is non-significant, then you are good to go!

Here is the basic ANOVA information: 

```{r}
Homogeneity_RegrSlp = lm(Chance.of.AdmitSQ~Research, data=graduate_admissions)
anova(Homogeneity_RegrSlp)
```

As you can see, the results of this ANOVA are saved into an object called ```Homogeneity_RegrSlp```.  Then you can use the ```lm()``` function to create your linear model.  The dependent variable will remain the same as the one you plan to use in your ANCOVA - ```Chance.of.AdmitSQ```. And the covariate you're planning to use in the ANCOVA will be the IV - ```Research```.  Specify the dataset with the argument ```data=``` and away you run! To get an ANOVA summary table out of this analysis, just call the ```anova``` function on your saved object. 

Here are the results: 

```text
Analysis of Variance Table

Response: Chance.of.AdmitSQ
           Df  Sum Sq Mean Sq F value    Pr(>F)    
Research    1  5.2035  5.2035   189.8 < 2.2e-16 ***
Residuals 398 10.9113  0.0274                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Unfortunately, since the *p* value is significant, your data does not meet the assumption of homogeneity of regression slopes.  That means that whether someone does research or not actually does have an impact on their chance of admittance to graduate school, and that you should NOT use ```Research``` as a covariate, but rather include it as a second independent variable in the model.  However, for the purposes of learning, you will just continue on.

---

### Sample Size

The last assumption for ANCOVAs is sample size.  There has to be at least 20 cases for every IV or CV.  Since you will have one IV and one CV, you will need at least 40 rows of data. In this case, you have 400 cases, so this assumption is more than adequately met!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - ANCOVAs in R<a class="anchor" id="DS105L7_page_4"></a>

[Back to Top](#DS105L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [4]:
from IPython.display import VimeoVideo
# Tutorial Video Name: ANCOVAs Part II
VimeoVideo('340997929', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L07pg3-4tutorial.zip)**.

# ANCOVAs in R

Tada! You have run the assumption gauntlet and come out victorious! It is now time to run your ANCOVA at last!

---

## Running the Analysis

You will run an ANCOVA using the same linear model format that you have used for regressions, but playing with the model terms some.  Here is the initial code needed for your ANCOVA: 

```{r}
ANCOVA = lm(Chance.of.Admit~Research + University.Rating*Research, data=graduate_admissions)
anova(ANCOVA)
```

The ```lm()``` function specifies that this is a linear model.  Then you'll input your y variable, or DV, first. In this case, since you are predicting a student's chance of admission into graduate school, your DV is ```Chance.of.Admit```.  What follows immediately after the tilde is a mash-up of things. There are a couple important things to note: 

1. The covariate(s) ALWAYS has to come first!! This is because the variables are taken in order in this case, and you want the effects of the covariate to be parsed out first.

2. The plus sign adds in additional covariates or independent variables.  

3. The asterisk creates what is called an *interaction term*.  It looks at the effects of one thing BY another thing. Sometimes, you will find effects when looking at an interaction term that were invisible even if you had both variables together in the model, so it's typically a best practice to always include an interaction term if you are going to have more than one IV or CV in your model (basically, anything that is not a one-way ANOVA).

So this model is looking at the chance of admittance to graduate school, holding whether students have done research constant, by University Rating, and examining how University Rating interacts with Research. It's possible, for instance, that only when a school is rated very well or very poorly that it impacts whether students' research matters. 

Once you have all your terms in place, you can call the ANCOVA with the ```anova()``` function, where you put the name of your model in as an argument.  It provides this output: 

```text
Analysis of Variance Table

Response: Chance.of.AdmitSQ
                            Df Sum Sq Mean Sq  F value    Pr(>F)    
Research                     1 5.2035  5.2035 349.7108 < 2.2e-16 ***
University.Rating            4 4.7389  1.1847  79.6203 < 2.2e-16 ***
Research:University.Rating   4 0.3694  0.0923   6.2063 7.527e-05 ***
Residuals                  390 5.8030  0.0149                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

This ANCOVA table should look somewhat familiar to you, as it follows the same format as other ANOVA variations.  Looking at it, you can see that there is a significant effect of the covariate, ```Research```, and a significant effect of the independent variable, ```University.Rating```, on a student's chance of admission to graduate school.  In addition, there is an interaction between ```Research``` and ```University.Rating```, indicating that those things combined also form a significant pattern that predicts admission to graduate school.

---

### When You've Failed Homogeneity of Variance

If you failed the assumption of homogeneity of variance, not to fear! There is a quick and easy additional line you can add to the base model.  Instead of running the ```anova()``` function, you can instead run the ```Anova()``` function.  Catch the difference there? The function that corrects for a violation of homogeneity of variance has a capital "A," whereas the "regular" summary function doesn't.  If it helps, think of the big "A" as standing for "assumption," so when you've violated the assumption, head for that!

Here is what using the big A ```Anova()``` function looks like: 

```{r}
Anova(ANCOVA, Type="I", white.adjust=TRUE)
```

Just put in the name of your model you created above (```ANCOVA```), and then specify the type of ANOVA you want.  The ```Type=``` argument refers to how the ANOVA is calculated, in terms of its sum of squares (noted as ```Sum Sq``` in your R output).  Whenever you have an interaction effect, you will always want ```Type="I"```, but here are all the types and what they are used for. To be honest, if you have a pretty basic and well-balanced design (relatively equal sample sizes in each of the groups), then it doesn't matter which one you select typically.

Types of ANOVAs for ```Type=``` Argument: 

* **Type I:** This is automatically used when you use the ```aov()``` function you were taught in earlier lessons. The sum of squares are taken sequentially, which means that they are calculated in the order in which they are listed in the model. 

* **Type II:** This type of ANOVA examines the effects of all the main effects, but ignores any interaction effects.  So it is not suitable if you have more than one IV or CV in your model and want to determine how they interact.

* **Type III:** This is used only when you want to look at only some sums of squares effects. Basically, you can examine only the effects you specify by changing the options for contrasts. You can think of *contrasts* as built-in, planned post hocs.  Specifying them in advance and being specific as to what you are looking for means that you have less Type I error, but they can be a pain to use.  Hence they will not be covered in this course!

Remember from previous lessons that the ```white.adjust=TRUE``` argument is what corrects for the violation of homogeneity of variance, so it is crucial to include.

Here are the results: 

```text
Analysis of Deviance Table (Type II tests)

Response: Chance.of.AdmitSQ
                            Df       F    Pr(>F)    
Research                     1 55.1840 6.988e-13 ***
University.Rating            4 93.1993 < 2.2e-16 ***
Research:University.Rating   4  4.2862  0.002094 ** 
Residuals                  390                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Everything remains significant, even after you've corrected for a violation of homogeneity of variance. 

---

## Post Hocs 

But where do the differences lie? To answer those questions, you need to follow up with a post hoc. For an ANCOVA, you will still run a post hoc with the Tukey's correction, but you will need to do so using functions from the ```multcomp``` package instead because you now need to handle the covariate and interaction effects.  You will do this using the ```glht()``` function: 

```{r}
postHocs <- glht(ANCOVA,linfct=mcp(University.Rating = "Tukey"))
summary(postHocs)
```

The independent variable will go in the second second of parentheses before the equals sign.  ```linfct=mcp``` is standard code that you will use routinely.

Here is the result: 

```text

	 Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts


Fit: lm(formula = Chance.of.AdmitSQ ~ Research + University.Rating * 
    Research, data = graduate_admissions)

Linear Hypotheses:
           Estimate Std. Error t value Pr(>|t|)    
2 - 1 == 0  0.08523    0.03012   2.830  0.03533 *  
3 - 1 == 0  0.16909    0.03080   5.490  < 0.001 ***
4 - 1 == 0  0.20415    0.04124   4.951  < 0.001 ***
5 - 1 == 0  0.26926    0.05068   5.313  < 0.001 ***
3 - 2 == 0  0.08386    0.02094   4.005  < 0.001 ***
4 - 2 == 0  0.11892    0.03450   3.447  0.00517 ** 
5 - 2 == 0  0.18403    0.04537   4.056  < 0.001 ***
4 - 3 == 0  0.03506    0.03510   0.999  0.84483    
5 - 3 == 0  0.10017    0.04583   2.186  0.17306    
5 - 4 == 0  0.06511    0.05340   1.219  0.72370    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Adjusted p values reported -- single-step method)
```

Looking in the ```Pr(>|t|)``` column shows the *p* values associated with each of these pairwise *t*-tests.  It looks like there was a significant difference between a university rated "1" and all other university ratings, a university rated "2" and all other university ratings, but no differences between universities rated "3," "4," or"5." What was higher between those differences that were significant? Well, to determine that, you will need to examine the means.

---

## Determine Means and Draw Conclusions 

Because you included a covariate in your model, it is important to look at adjusted means, rather than the raw means.  The means are adjusted by controlling for your covariate.  This may sound needlessly complicated to you, but never fear - the ```effects``` package makes it easy! 

```{r}
adjMeans <- effect("University.Rating", ANCOVA)
adjMeans
```

Just call the ```effect()``` function, then put your IV in quotes and specify the model that it came from.  When you run the name you gave the object, you will get this output:

```text

 University.Rating effect
University.Rating
        1         2         3         4         5 
0.3276019 0.4220928 0.5171503 0.6252554 0.7106287 
```

So with these means, you can see here that a student who attended undergrad in a university rated a 1 has a significantly lower chance of being accepted (33%) when compared to all other university rating levels.  A student who attended undergrad in a university rated a 2 also has a significantly lower chance of being accepted (42%) than anyone who attended a school that was rated a 3, 4, or 5.  However, attending a school rated a 3, 4, or 5 did not significantly improve your odds of being accepted to graduate school.

Drawing an overall conclusion from this data, you might say something like "as long as you attended a medium or better rated university for undergraduate, you have relatively good odds of being accepted to graduate school, but students who attended a poorly rated university are much less likely to be accepted, when controlling for the effects of having participated in research in undergrad."

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - ANCOVA Activity<a class="anchor" id="DS105L7_page_5"></a>

[Back to Top](#DS105L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


For your Activity, you will be computing an ANCOVA to predict a student's college GPA (```CGPA```), holding the TOEFL Score (```TOEFL_Score```) constant and using their university's rating (```University.Rating```) as as a predictor.  `This Activity will not be graded`, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

Using the **[graduate_admissions dataset](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/graduate_admissions.zip)**, you will answer the question: "Does University Rating significantly predict your college GPA when holding TOEFL score constant?" 

In order to do so, don't forget to: 

-Test your assumptions and correct for them if necessary
-Run the appropriate model type
-Compute post hocs if faced with omnibus significance
-Examine the means and draw conclusions to answer the question

Explain each step of the test and write your conclusions in your R script and attach it.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - ANCOVA Activity Solution<a class="anchor" id="DS105L7_page_6"></a>

[Back to Top](#DS105L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Solution

Below you will find the solution to the ANCOVA activity.

---

## Answer

University rating does not significantly predict college GPA when TOEFL scores are controlled for after controlling for Type I error.

---

## Code

```{r}
#This is to predict college GPA holding TOEFL score constant and using university rating as a predictor.

## Testing Assumptions

### Normality - Need to examine both GPA and TOEFL score

plotNormalHistogram(graduate_admissions$CGPA)

# That looks approximately normal, but could use a square transformation.  Try it! 

graduate_admissions$CGPAsq <- graduate_admissions$CGPA * graduate_admissions$CGPA
plotNormalHistogram(graduate_admissions$CGPAsq)

# Looks great! 

plotNormalHistogram(graduate_admissions$TOEFL.Score)

# That looks pretty good as well, but try a square transformation just in case.

graduate_admissions$TOEFL.ScoreSQ <- graduate_admissions$TOEFL.Score * graduate_admissions$TOEFL.Score
plotNormalHistogram(graduate_admissions$TOEFL.ScoreSQ)

# That looks better as well.  Use Squared for both of them.

### Homogeneity of Variance

leveneTest(CGPAsq~University.Rating, data=graduate_admissions)

# Results were not significant, so the assumption is met!

### Homogeneity of Regression Slopes

Homogeneity_RegrSlp = lm(CGPA~TOEFL.Score, data=graduate_admissions)
anova(Homogeneity_RegrSlp)

# This isn't met, but I'll proceed anyway for learning purposes. In the real world, I would use this as an IV!

### Sample size is met - need 20 per IV or CV and I have 2, so need at least 40 and there are 400 cases!

## Running the Analysis

ANCOVA = lm(CGPA~TOEFL.Score + University.Rating*TOEFL.Score, data=graduate_admissions)
anova(ANCOVA)

# Significant interaction between TOEFL score and University Rating, but there is a significant impact on university rating and on TOEFL score by itself.

## Post Hocs

postHocs <- glht(ANCOVA,linfct=mcp(University.Rating = "Tukey"))
summary(postHocs)

# After examining the post hocs, it looks like the overall significance in the F test above was just an artifact of Type I error, since no group significantly differs from any other group.

## Examine Adjusted Means

adjMeans <- effect("University.Rating", ANCOVA)
adjMeans

# Looking at the means confirms my conclusion above - all of these have a college GPA that is about the same.
```

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Key Terms<a class="anchor" id="DS105L7_page_7"></a>

[Back to Top](#DS105L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Analysis of Covariance (ANCOVA)</td>
        <td>An ANOVA that controls for another variable using a covariate.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Covariate</td>
        <td>The "C" in ANCOVA.  A variable you are trying to control for in your model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Homogeneity of Regression Slopes</td>
        <td>An assumption for ANOVA that requires the CV and DV to be unrelated. If it is violated, the CV should be added as an additional IV rather than as a covariate.</td>
    </tr>
</table>

---

## Key R Libraries

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>effects</td>
        <td>Used to get adjusted means for ANCOVAs.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>multcomp</td>
        <td>Used to get post-hocs for ANCOVAs.</td>
    </tr>
</table>


---

## Key R Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>glht()</td>
        <td>Creates a Tukey's post hoc that takes into account the effect of a covariate.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>effect()</td>
        <td>A function that generates adjusted means that take into account the effect of the covariate.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Lesson 7 Hands-On<a class="anchor" id="DS105L7_page_8"></a>

[Back to Top](#DS105L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


For this Hands On, you will be performing an ANCOVA in R to determine whether having an international cell phone plan increases the number of night minutes you use. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

This hands on uses a dataset on cell phone plans. It is located **[here](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/cellPhone.zip)**. 

Please answer the following question with this data: 

> Many folks with international relatives often find themselves calling at odd hours to fit typical schedules in other time zones.  How does the presence or absence of an international phone plan (```International.Plan```) influence the use of nighttime minutes (```Night.Mins```), holding whether or not the client has a voicemail plan (```vMail.Plan```) constant? 

In order to answer this question, you will need to do the following:

* Test for ANCOVA assumptions
* Run an ANCOVA
* Interpret the ANCOVA results and draw conclusions
* Conduct post hocs if necessary

Please submit your R studio file, with a one-sentence conclusion to answer the above question.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Lesson 7 Hands-On Solution<a class="anchor" id="DS105L7_page_9"></a>

[Back to Top](#DS105L7_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Lesson 7 Hands-On Solution

Below you will find the R code for the Lesson 7 Hands-On solution.

```{r}
#This is to see if the Night.minutes differ by whether they have an international phone plan, holding voice mail plans constant.  

## Testing Assumptions

### Normality - Need to examine both GPA and TOEFL score

library("rcompanion")
library("car")
library("effects")
library("multcomp")

plotNormalHistogram(cellPhone$Night.Mins)

#### Wow, that already looks normally distributed! No transformation necessary. 

### Homogeneity of Variance

leveneTest(Night.Mins~International.Plan, data=cellPhone)

# Results were not significant, so the assumption is met!

### Homogeneity of Regression Slopes

Homogeneity_RegrSlp = lm(Night.Mins~vMail.Plan, data=cellPhone)
anova(Homogeneity_RegrSlp)

# This assumption was met as well! 

### Sample size is met - need 20 per IV or CV and I have 2, so need at least 40 and there are over 4,000 cases!

## Running the Analysis

ANCOVA = lm(Night.Mins~vMail.Plan + International.Plan*vMail.Plan, data=cellPhone)
anova(ANCOVA)

# Whether a client has an international plan or not does not influence the number of night minutes he or she uses, even holding whether they have a voice mail plan constant.
```
