# DS105 Intermediate Statistics : Lesson Five Companion Notebook

### Table of Contents <a class="anchor" id="DS105L5_toc"></a>

* [Table of Contents](#DS105L5_toc)
    * [Page 1 - Introduction to Repeated Measures ANOVAs](#DS105L5_page_1)
    * [Page 2 - Repeated Measures ANOVAs Setup in R](#DS105L5_page_2)
    * [Page 3 - Repeated Measures ANOVAs Analysis in R](#DS105L5_page_3)
    * [Page 4 - Repeated Measures ANOVA in R Activity](#DS105L5_page_4)
    * [Page 5 - Repeated Measures ANOVA in R Activity Solution](#DS105L5_page_5)
    * [Page 6 - Key Terms](#DS105L5_page_6)
    * [Page 7 - Lesson 05 Hands-On](#DS105L5_page_7)
    * [Page 8 - Lesson 05 Hands-On Solution](#DS105L5_page_8)    

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction to Repeated Measures ANOVAs<a class="anchor" id="DS105L5_page_1"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to Repeated Measures ANOVAs
VimeoVideo('390081187', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L05overview.zip)**.

# Introduction

*Repeated Measures ANOVAs*, also known as within subjects ANOVAs, are when you are measuring the same person or thing repeatedly over time.  Although they are used extensively in research studies and experiments, they often have in real-world data science applications when looking at changes over time. For instance, did unemployment rate increase from 2008 to 2009? 

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>If you are really into math and want to learn the theory and hand-calculations behind repeated measures ANOVAs, check out <a href="https://statistics.laerd.com/statistical-guides/repeated-measures-anova-statistical-guide.php">this resource</a></p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - Repeated Measures ANOVAs Setup in R<a class="anchor" id="DS105L5_page_2"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Repeated Measures ANOVAs Setup in R

Now that you have a basic understanding of what repeated measures ANOVAs are, you will jump right in!

---

## Load Libraries

Repeated measures ANOVAs come as part of the base package in R, so the only libraries you will need to load in are ```rcompanion``` because you'll use it to check for the assumption of normality, ```car``` if you need to run an ANOVA that will correct for a violation of homogeneity of variance, and ```fastR2```, which is used for some data wrangling to get your data in the right shape for repeated measures ANOVAs.

```{r}
library("rcompanion")
library("fastR2")
library("car")
```

---

## Load in Data

You will be examining **[data](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/breakfast.zip)** from a study about the effect of eating breakfast on weight loss and associated metrics, such as resting metabolic rate and waist circumference.  Most metrics were measured at baseline, and then again at follow-up, which was six weeks later.

---

## Question Setup

With this data, you will answer the question: 

```text
Overall, regardless of whether participants ate breakfast or not, did people in this study show improvement in their resting metabolic rate?   
```

In order to answer this question, your x, or independent variable, will be the time factor - baseline or follow-up.  Your y, or dependent variable, will be the change in resting metabolic rate from baseline to follow-up. As with all ANOVAs, the IV will be categorical, and the DV will be continuous.

---

## Data Wrangling

Depending on the data that you're working with, it may need some wrangling! 

---

### Removing Extra Rows

In this case, take an initial look at the dataset.  See anything funny here? 

![A table has five tabs on top and the tabs read, baseline.resting.metabolic.rate, follow.up.resting.metaboli.rate, diet.induced.therogenesis, week..1.physical.activity, and week.6.physical activity. The first five row entries are as follows. Row 1, 34, fasting, 22, female, 1.66, 67.1, 65.9, and 40.1. Row 2, 35, breakfast, 28, male, 1.85, 72.3, 70.1, and 60.4. Row 3, 50, fasting, 34, male, 1.79, 82.0, 82.0, and 65.0. Row 4, 10, breakfast, 23, male, 1.93, 83.5, 82.6, and 64.9. Row 5, 23, fasting, 54, female, 1.75, 73.5, 72.8, and 44.3. The rest of the fields are labeled NA.](Media/ANCOVA1.png)

Even though your data stops in the CSV file after 35 rows (or so it appears), there is a lot of NA data showing up in R if you scroll down a little ways.  It is so very important to always scope your data out when you transfer it, just in case something is not as you expect! Seeing the red flag of having all sorts of missing values, if you examine either the CSV file or the R data frame enough, you will find that the original author of this spreadsheet included some extra statistical tables in the middle.  What a pain! This can be addressed in one of two ways: you can run the ```NaRV.omit()``` function and clean up any missing data, which should remove everything you're not interested in, or you can just subset your data.  In this case, you'll subset the data to only include the rows you want: 

```{r}
breakfast1 <- breakfast[1:33,]
```

---

### Reshaping the Data

The other thing that needs to be done with your data is to reshape it, from width-wise to long-wise, so that you can run the ANOVA. The code to reshape only takes the variables that you want to flip and anything you want to hold constant, so you will need to subset your data again to include only the columns you are interested in: 

```{r}
keeps <- c("Participant.Code", "Treatment.Group", "Age..y.", "Sex", "Height..m.", "Baseline.Resting.Metabolic.Rate..kcal.d.", "Follow.Up.Resting.Metabolic.Rate..kcal.d.")
breakfast2 <- breakfast1[keeps]
```

First you can create a vector of the names of the columns you do want to keep.  ```Participant.Code```, ```Treatment.Group```, ```Age..y.```, ```Sex```, and ```Height..m.``` are all constants, meaning that they are not changing as you measure over time.  You don't have a time one and a time two measurement. You also want to keep the one change over time variable you are interested at both time points, which is the the resting metabolic rate. After you've created the vector to keep, you can then apply that to your truncated dataset. When you're done, your dataset should look like this: 

![A table has seven columns labeled participant.code, treatment.group, age, sex, height, baseline.resting.metabolic.rate, and follow.up.resting.metaboli.rate. The first five row entries are as follows. Row 1, 9, breakfast, 40, female, 1.76, 1279, and 1279. Row 2, 15, breakfast, 25, female, 1.70, 1323, and 1413. Row 3, 18, breakfast, 56, female, 1.65, 1281, and1275. Row 4, 21, breakfast, 24, male, 1.79, 1605, and 1642. Row 5, 25, breakfast, 45, female, 1.69, 1196, and 1152.](Media/ANCOVA2.png)

Now comes the actual reshaping! You will need to do this for both the baseline and the follow up data. Basically, you are going to keep the first five columns that don't change by timepoint, and then add to that new columns of `repdat` and `contrasts`.  The `repdat` column will hold the actual data from the baseline section, and the `contrasts` column will hold the information that says it was from the baseline timepoint.

```{r}
breakfast3 <- breakfast2[,1:5]
breakfast3$repdat <- breakfast2$Baseline.Resting.Metabolic.Rate..kcal.d.
breakfast3$contrasts <- "T1"
```

This is now what your data should look like: 

![A table has seven columns labeled participant.code, treatment.group, age, sex, height, repdat, and contrasts. The first five entries are as follows. Row 1, 2, fasting, 27, female, 1.75, 1418, and T1. Row 2, 4, fasting, 25, female, 1.72, 1332, and T1. Row 3, 11, fasting, 44, male, 1.64, 1521, and T1. Row 4, 14, fasting, 36, female, 1.68, 1399, and T1. Row 5, 16, fasting, 28, female, 1.64, 1457, and T1.](Media/ANCOVA7.png)

You will do the same thing with your follow-up data: 

```{r}
breakfast4 <- breakfast2[,1:5]
breakfast4$repdat <- breakfast2$Follow.Up.Resting.Metabolic.Rate..kcal.d.
breakfast4$contrasts <- "T2"
```

Once you have both of those, then you need to `rbind()` them back together into one whole dataset: 

```{r}
breakfast5 <- rbind(breakfast3, breakfast4)
```

Now you are all prepared to run a repeated measures ANOVA, data shaping wise.  

---

## Testing Assumptions 

The assumptions for a repeated measure ANOVA are the same as the ones you learned for a one-way between subjects ANOVA, with the addition of the assumption of sphericity. Recall that sphericity is the idea that things that occur closer together in time or space may be more related than things that occur farther away in time or space.  You'll need to test for that unequal relationship between time points and correct for it if sphericity is present. 

Remember that if the assumptions are not met for ANOVA, but you proceeded anyway, you run the risk of biasing your results. 

---

### Normality

You only need to test for the normality of the dependent variable, but you need to do it at both timepoints. So scoot on back to ```breakfast2```, which is the dataset you truncated but had not yet reshaped.

```{r}
plotNormalHistogram(breakfast2$Baseline.Resting.Metabolic.Rate..kcal.d.)
```

Hoozah! Looks pretty normal and you'll take it, no transformation necessary!  

![A bar graph depicts the plot of x against the frequency. The x-axis represents x and the y-axis represents frequency. The x-axis ranges from 1200 to 1800. The y-axis ranges from 0 to 7. Eight bars and a curve are plotted.](Media/ANCOVA4.png)

Try the follow-up data: 

```{r}
plotNormalHistogram(breakfast2$Follow.Up.Resting.Metabolic.Rate..kcal.d.)
```

![A bar graph depicts the plot of x against the frequency. The x-axis represents x and the y-axis represents frequency. The x-axis ranges from 0.3 to 1.0. The y-axis ranges from 0 to 60. Fourteen bars and a curve are plotted.](Media/ANCOVA5.png)

And result! Another one normal enough to use without transformation! Yes, it may be a little platykurtic, but it's centered in the middle, so count your blessings and go with it!

---

### Homogeneity of Variance

The tests you learned for homogeneity of variance for one-way ANOVAs will not work for repeated measures if you need to include any other information.  In this case, you are not just looking for whether the resting metabolic rate increased over time, you are looking to see if it changed over time based on the condition (eating breakfast or skipping breakfast) the patient was placed in.  So, a *Levene's Test* can be used instead to check for homogeneity of variance, using the function from the ```car``` library, ```leveneTest()```.  Here you will specify the variable information you are testing.  Your y variable will go first, separated by a tilde and followed by your x variable and an asterisk, then your time variable. The time variable is ```contrasts```, and remember that it represents time 1 (baseline) or time 2 (follow up). Altogether, you can read this code as a sentence like this: "Resting metabolic rate by treatment group over time." 

```{r}
leveneTest(repdat ~ Treatment.Group*contrasts, data=breakfast5)
```

Here are the results of the Levene's Test: 

```text
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  3  1.0251  0.388
      60               
```

Just like the other tests for homogeneity of variance, you want Levene's test to be non-significant in order to pass this assumption.  And lo and behold, it is! No need for correction!

---

#### Correcting for Violations of Homogeneity of Variance

If you had violated the assumption of homogeneity of variance, you could correct for it by running a BoxCox transformation on your data, or by running a more robust ANOVA, that can handle a violation of this assumption.  

---

### Sample Size

A repeated measures ANOVA requires a sample size of at least 20 per independent variable. You have that, so this assumption has been met. 

---

### Sphericity

The only way to test for sphericity in R is to take a multivariate approach and make it work for an ANOVA.  At this time, that is a bit too complex, but it may be covered later.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Repeated Measures ANOVAs Analysis in R<a class="anchor" id="DS105L5_page_3"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to Repeated Measures ANOVAs
VimeoVideo('340999630', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L05pg3tutorial.zip)**.

# Repeated Measures ANOVAs Analysis in R

Alright! You've done all the prep work, now it's time for the fun! 

---

## Analysis


You will continue to use the ```aov()``` function, but add some additional arguments to it to make it repeated measures. 

```{r}
RManova <- aov(repdat~contrasts+Error(Participant.Code), breakfast5)
summary(RManova)
```

So what's happening here is that you are calling the ```aov()``` function on your repeated data of metabolic rate by your timepoint, and then adding in an error term, which is what makes this a repeated measure - you are telling it that it should be looking within each participant, which is what ths part of the code does: ```Error(Participant.Code)```.  Finish it all off by specifying the dataset at the end and you are good to go. Call ```summary()``` on the results:  

```text
Error: Participant.Code
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals  1  105.3   105.3               

Error: Within
          Df Sum Sq Mean Sq F value Pr(>F)
contrasts  1      2    1.64   0.027  0.871
Residuals 63   3854   61.17 
```

Under the ```Error:Within``` table (since this is a within subjects ANOVA, after all), the you will find your *F* value and the associated *p* value.  Looks like there is not a significant effect of time on resting metabolic rate.   

---

## Post Hocs

The overall test wasn't significant, so no need to worry about post hocs.  

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - Repeated Measures ANOVA in R Activity<a class="anchor" id="DS105L5_page_4"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


For this Activity, you will perform a repeated measures ANOVA in R. `This Hands-On will not be graded`, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

Using the **[breakfast data from last page](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/breakfast.zip)**, determine whether weight changes from baseline to follow up. In order to do this, you will need to: 

* Wrangle the data
* Test for assumptions
* Run the analysis for repeated measures ANOVA

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Repeated Measures ANOVA in R Activity Solution<a class="anchor" id="DS105L5_page_5"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Solution

---

## Answer 

Whether a participant ate breakfast or not did not impact their change in body mass from baseline to follow up!  

---

## Code

```{r}
# Subsetting the data to get rid of unnecessary rows and columns
  
breakfast1 <- breakfast[1:33,1:7]

keeps <- c("Participant.Code", "Treatment.Group", "Age..y.", "Sex", "Height..m.", "Baseline.Resting.Metabolic.Rate..kcal.d.", "Follow.Up.Resting.Metabolic.Rate..kcal.d.")
breakfast2 <- breakfast1[keeps]

# Getting the data in the right shape for the baseline measure.
breakfast3 <- breakfast2[,1:5]
breakfast3$repdat <- breakfast2$Baseline.Body.Mass..kg.
breakfast3$contrasts <- "T1"

# Getting the data in the right shape for the folow-up measure.
breakfast4 <- breakfast2[,1:5]
breakfast4$repdat <- breakfast2$Follow.Up.Body.Mass..kg.
breakfast4$contrasts <- "T2"

# Then smoosh 'em back together with binding
breakfast5 <- rbind(breakfast3, breakfast4)

# Testing for Normality

plotNormalHistogram(breakfast1$Baseline.Body.Mass..kg.)
plotNormalHistogram(breakfast1$Follow.Up.Body.Mass..kg.)

# They look approximately normal, so don't need transformation

# Testing for Homogeneity of Variance

leveneTest(repdat ~ Treatment.Group*contrasts, data=breakfast5)

# It was not significant, which means this assumption has been met

RManova2 <- aov(repdat~contrasts+Error(Participant.Code), breakfast5)
summary(RManova2)

# Nothing was significant here either!
```

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Key Terms<a class="anchor" id="DS105L5_page_6"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Repeated Measures / Within Subjects ANOVAs</td>
        <td>When you measure the same person or thing repeatedly over time.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Levene's Test</td>
        <td>Used to test for homogeneity of variance for more complex ANOVAs. To meet the assumption, Levene's Test should NOT be significant.</td>
    </tr>
</table>

---

## Key R Libraries

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>fastR2</td>
        <td>Used for re-shaping data for repeated measures ANOVAs.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>car</td>
        <td>Used for advanced linear models, car contains things such as Levene's Test an an alternative way to calculate ANOVAs when homogeneity of variance is not met.</td>
    </tr>
</table>


---

## Key R Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>make.rm() </td>
        <td>Function that reshapes your data for repeated measures analyses.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>constant=</td>
        <td>An argument in make.rm() where you will list all the variables not measured at only one time point.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>repeated=</td>
        <td>An argument in make.rm() where you will list all the variables measured at multiple time points.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>leveneTest()</td>
        <td>Performs a Levene's Test for homogeneity of variance.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>lm()</td>
        <td>Creates a linear model, which uses the same mathematics as analyses in the ANOVA family.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>anova()</td>
        <td>Creates an ANOVA table out of a linear model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Anova()</td>
        <td>Creates an ANOVA table out of a linear model that can be adjusted for not meeting the assumption of homogeneity of variance.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Type=</td>
        <td>An argument in the Anova() function that allows you to specify how the sums of squares for the ANOVA are calculated. Options are I, II, or III. </td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - Lesson 05 Hands-On<a class="anchor" id="DS105L5_page_7"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


For your Practice Hands On, you will be analyzing data about honey production in R. `This Hands-On will not be graded`, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

This hands on uses a dataset about honey production over the years. It is located **[here](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/honey.zip)**. 

You will determine whether honey production ```totalprod``` has changed over the years (```year```) using a repeated measures ANOVA. Provide a one-sentence conclusion at the bottom of your program file about the analysis you performed. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>



<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - Lesson 05 Hands-On Solution<a class="anchor" id="DS105L5_page_8"></a>

[Back to Top](#DS105L5_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Lesson 5 Hands-On Solution

Below you will find the solution to the Lesson 5 Practice Hands-On:

```{r}
library("rcompanion")
library("fastR2")
library("car")
library("dplyr")

#Load in Data
honey.df <- read.csv("honey.csv")

#Data Wrangling
honey.df$year <- as.character(honey.df$year)
honey.df$year <- as.factor(honey.df$year)

#Postively skewed
plotNormalHistogram(honey.df$totalprod)

#Log transformation looks great
plotNormalHistogram(log(honey.df$totalprod))

honey.df$totalprodLOG <- log(honey.df$totalprod)

#Check for Assumptions

#Passed assumption of homogenity of variance for normally distributed variable
leveneTest(totalprodLOG ~ year, data=honey.df)

#Run the Analysis
RManova <- aov(totalprodLOG~year+Error(state), honey.df)
summary(RManova)

RManova <- aov(log(totalprod)~year, honey.df)
summary(RManova)
```