# DS105 Intermediate Statistics : Lesson Eight Companion Notebook

### Table of Contents <a class="anchor" id="DS105L8_toc"></a>

* [Table of Contents](#DS105L8_toc)
    * [Page 1 - Introduction to MANOVAs](#DS105L8_page_1)
    * [Page 2 - MANOVAs](#DS105L8_page_2)
    * [Page 3 - MANOVA Preparation in R](#DS105L8_page_3)
    * [Page 4 - MANOVAs in R](#DS105L8_page_4)
    * [Page 5 - Key Terms](#DS105L8_page_5)
    * [Page 6 - Lesson 8 Hands-On](#DS105L8_page_6)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction to MANOVAs<a class="anchor" id="DS105L8_page_1"></a>

[Back to Top](#DS105L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [5]:
from IPython.display import VimeoVideo
# Tutorial Video Name: MANOVAs
VimeoVideo('388629776', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L08overview.zip)**.

# Introduction

The MANOVA, or multivariate analysis of variance, is like an ANOVA, but on steroids! It allows for the use of multiple dependent variables that are related, as well as multiple independent variables and/or covariates. MANOVAs can be a very powerful tool, and they are especially good screening tests so that you aren't accidentally increasing your changes of Type I error.

In the following lesson, you will learn about MANOVAs and what they are used for.  You will also learn their assumptions and how to test for them in R, before actually conducting a MANOVA and associated post hocs.

This lesson will culminate with a hands on in which you will use gender to predict heart attack risk factors.


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - MANOVAs<a class="anchor" id="DS105L8_page_2"></a>

[Back to Top](#DS105L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [6]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to MANOVAs
VimeoVideo('341609921', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L08pg2tutorial.zip)**.

# Introduction to MANOVAs

*MANOVAs* or *multivariate analyses of variance* are in the ANOVA family as well, but now allow multiple dependent variables that are theoretically related.  *Multivariate* - meaning multiple variables! If you are thinking about testing lots of data all at once using the same predictors, or independent variables, it is a good idea to try a MANOVA, because you will decrease the chances of Type I error. 

Other than decreasing error rate, another advantage of using a MANOVA instead of multiple ANOVAs is that it has increased power to detect effects.  You may find that there is no effect of an independent variable on one dependent variable alone, but that when you look at a trend with multiple dependent variables at once, something significant pops up. 

Just like with ANOVAs, you can have all kinds of variations - one way, two-way, repeated measures, covariates, and contrasts are all fair game with MANOVAs too! 

---

# Assumptions for MANOVA

The assumptions for MANOVA are relatively similar to those for ANOVAs, but they are ramped up a notch to handle the addition of multiple dependent variables and the increased power.  

---

## Sample Size

There must be more cases than dependent variables in every cell.  In addition, there must be at least 20 cases per independent variable, as per ANOVAs.

---

## Multivariate Normality

What? You know all about normality, but what is this suspicious multivariate normality?! All it means is that your dependent variables need to be normally distributed when they are lumped all together in one uber-variable that you'll use for your MANOVA.

---

## Homogeneity of Variance 

Like ANOVAs, you need to make sure that the variables you are using have relatively equal variance. 

---

## Absence of Multicollinearity

*Multicollinearity* is when there is a significant relationship between the dependent variables in your model.  It is to be avoided, since having a lot of overlap between your DVs can again increase your chances of Type I error; finding a significant relationship between your IV and your DV when one really isn't there. Testing for multicollinearity just requires a correlation matrix, although there are specific statistics designed to test for it as well.

---

## Independence

The assumption of independence is the same for ANOVAs as it is for MANOVAs. In a nutshell, the different levels of your independent variable should NOT be related to each other! This isn't something typically tested for, but rather assessed by using your noggin as you think about the data you're about to analyze.  If there's a chance that a participant or an object will fit into more than one level of your independent variable, than chances are you have violated the assumption of independence and should not choose to run a MANOVA!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - MANOVA Preparation in R<a class="anchor" id="DS105L8_page_3"></a>

[Back to Top](#DS105L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [7]:
from IPython.display import VimeoVideo
# Tutorial Video Name: MANOVAs Part I
VimeoVideo('341610015', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L08pg3-4tutorial.zip)**.

# MANOVA Preparation in R

Now that you have a good idea of what a MANOVA is, you will learn how to prepare data and test assumptions for the MANOVA in R. 

---

## Load Libraries

The MANOVA function comes in the base package of R, so the libraries that you will need to load are all related to assumption testing. You will need the following: ```mvnormtest``` to test for multivariate normality, and ```car``` to test for homogeneity of variance. 

```{r}
library("mvnormtest")
library("car")
```

---

## Load in Data

You will be using **[data about Kickstarter Projects](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/kickstarter.zip)** to learn MANOVAs. This data has information about the project, it's category, the deadline and goal for fundraising, when the project was launched, the amount of money pledged, the country, and the current state of the project.

---

## Question Set Up

You will be answering the following question with this data: 

```text
Does the country the project originated in influence the number of backers and the amount of money pledged?
``` 

To answer this question, the independent variable will be the country the project originated in, ```country```.  This is a categorical variable. The two dependent variables will be the number of backers (```backers```) and the amount pledged (```pledged```). These variables are both continuous.

---

## Data Wrangling

Although no data wrangling is actually required for the MANOVA itself, some wrangling is required to test for assumptions. In order to test for multivariate normality, you will need to create a dataset containing only your two dependent variables that is in a matrix format, and you will need to ensure that they are numeric. Unfortunately, the test for normality can only handle 5,000 records, so you will also need to limit your data to 5,000 rows as well.

---

### Ensure Variables are Numeric

And then check the structure of the data to see what format your dependent variables are in.

```{r}
str(kickstarter$pledged)
str(kickstarter$backers)
```

Oh no! Both of them are currently set up as factors! Best convert them using the ```as.numeric()``` function.

```{r}
kickstarter$pledged <- as.numeric(kickstarter$pledged)
kickstarter$backers <- as.numeric(kickstarter$backers)
```

---

### Subsetting 

Next, keep only your two dependent variabes, ```pledged``` and ```backers```.

```{r}
keeps <- c("pledged", "backers")
kickstarter1 <- kickstarter[keeps]
```

Then limit the number of rows: 

```{r}
kickstarter2 <- kickstarter1[1:5000,]
```

---

### Format as a Matrix

Lastly, format the data as a matrix: 

```{r}
kickstarter3 <- as.matrix(kickstarter2)
```

You are now ready to perform the assumptions test for multivariate normality on ```kickstarter3```.

---

## Test Assumptions

With the data wrangling out of the way, it is now time to test assumptions!

---

### Sample Size

The first assumption of MANOVAs is sample size. The rule of thumb is that you must have at least 20 cases per independent variable, and that there must be more cases then dependent variables in every cell.  Meaning that there must be more than 2 cases for each country.  Happily, both of these are fulfilled with a dataset of 323,746! 

---

### Multivariate Normality

To test for multivariate normality, you will use the dataset you wrangled, ```kickstarter3```, in the Wilks-Shapiro test.  You can do that with the function ```mshapiro.test()``` pulled from the ```mvnormtest``` library: 

```{r}
mshapiro.test(t(kickstarter3))
```

And here are the results: 

```text
	Shapiro-Wilk normality test

data:  Z
W = 0.98423, p-value < 2.2e-16
```

You have violated the assumption of multivariate normality if the *p* value is significant at *p* < .05, so unfortunately, these data do not meet the assumption for MANOVAs.  However, for learning purposes, you will continue.

---

### Homogeneity of Variance

You can use Levene's Test from the ```car``` library to test for homogeneity of variance on both of your dependent variables: 

```{r}
leveneTest(pledged ~ country, data=kickstarter)
```

Here are the results for ```pledged```: 

```text
Levene's Test for Homogeneity of Variance (center = median)
          Df F value    Pr(>F)    
group    161  3.4529 < 2.2e-16 ***
      323588                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Don't forget to test for ```backers``` as well!

```{r}
leveneTest(backers ~ country, data=kickstarter)
```

The results are as follows: 

```text
Levene's Test for Homogeneity of Variance (center = median)
          Df F value    Pr(>F)    
group    161   94.43 < 2.2e-16 ***
      323588                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Unfortunately, neither variable met the assumption of homogeneity of variance, since they were both significant at *p* < .05.  You have violated the assumption of homogeneity of variance, but you will proceed for now for learning purposes.

---

### Absence of Multicollinearity

Typically, multicollinearity can be assessed simply by running correlations of your dependent variables with each other. A general rule of thumb is that anything above approximately .7 for correlation (i.e. a strong correlation) indicates the presence of multicollinearity.  Check out the correlation between ```pledged``` and ```backers``` with a simple ```cor.test()``` function: 

```{r}
cor.test(kickstarter$pledged, kickstarter$backers, method="pearson", use="complete.obs")
```

And voila! Finally an assumption you have met! With a correlation of *r* = .32, you have an absence of multicollinearity.

```text
	Pearson's product-moment correlation

data:  x and y
t = 194.14, df = 323750, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3198359 0.3260068
sample estimates:
      cor 
0.3229248 
```

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - MANOVAs in R<a class="anchor" id="DS105L8_page_4"></a>

[Back to Top](#DS105L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [8]:
from IPython.display import VimeoVideo
# Tutorial Video Name: MANOVAs Part II
VimeoVideo('341611130', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L08pg3-4tutorial.zip)**.

# MANOVAs in R

Alright! With all that prep work out of the way, it's time to get down to business! 

---

## The Analysis

You will use the function ```manova``` to run a MANOVA. Go figure!

```{r}
MANOVA <- manova(cbind(pledged, backers) ~ country, data = kickstarter)
summary(MANOVA)
```

In this code, you are binding your two dependent variables together with the function ```cbind()```, so they can be examined as one.  Then you are able to specify your independent variable, ```country```, like any other ANOVA.  Here are the results when you call ```summary```:

```text
              Df   Pillai approx F num Df den Df    Pr(>F)    
country      161 0.030427   31.049    322 647176 < 2.2e-16 ***
Residuals 323588                                              
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Tada! You have run your first MANOVA. Looks like it was significant, too - there is a significant difference in backers and the funds they have pledged by country.  

---

## Post Hocs

But which dependent variable you ask? Well, that's where the post-hocs come in. 

---

### ANOVAs as Post Hocs

The initial post-hoc for a MANOVA is, in fact, an ANOVA.  Luckily, there is code set aside to do this as a post-hoc in R, so that you don't have to create your own ANOVA models. Check out the ```summary.aov()``` function in action: 

```{r}
summary.aov(MANOVA, test = "wilks") 
```

Simply feed in the name of your MANOVA model. This function would work without the additional argument of ```test=```, but like post-hocs for ANOVAs, it is nice to use a correction for Type I error since you are doing so many multiple comparisons.  In this case, you can use the ```"wilks``` correction, specified above.

Here are the results: 

```text
 Response pledged :
                Df     Sum Sq    Mean Sq F value    Pr(>F)    
country        161 2.8502e+11 1770295565  5.5966 < 2.2e-16 ***
Residuals   323588 1.0236e+14  316318677                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 Response backers :
                Df     Sum Sq  Mean Sq F value    Pr(>F)    
country        161 1.1936e+10 74135319  54.325 < 2.2e-16 ***
Residuals   323588 4.4159e+11  1364675                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Isn't R wonderful? It provides you with the appropriate output for both dependent variables with one click.  As you can see here, there is a significant difference in both the amount of funds pledged and the number of backers by country.

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Key Terms<a class="anchor" id="DS105L8_page_5"></a>

[Back to Top](#DS105L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Multivariate Analysis of Variance (MANOVA)</td>
        <td>A type of test like an ANOVA, but with multiple dependent variables.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Multivariate Normality</td>
        <td>An assumption for MANOVAs that requires your DVs to be normally distributed when all joined into one group dependent variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Multicollinearity</td>
        <td>When there is a significant relationship between your DVs. The absence of multicollinearity is an assumption for MANOVAs.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Wilks-Shapiro Test</td>
        <td>A test used to assess multivariate normality. It should NOT be significant to pass the assumption.</td>
    </tr>
</table>

---

## Key R Libraries

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>mvnormtest</td>
        <td>Used to test for multivariate normality.</td>
    </tr>
</table>

---

## Key R Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>mshapiro.test(t())</td>
        <td>Computes the Wilks-Shapiro test used to test for multivariate normality.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>manova()</td>
        <td>Computes a MANOVA.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>summary.aov()</td>
        <td>Computes post hocs for a MANOVA.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - Lesson 8 Hands-On<a class="anchor" id="DS105L8_page_6"></a>

[Back to Top](#DS105L8_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


For this Hands On, you will be performing a MANOVA in R.  

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

This hands on uses a dataset that can predict heart attacks. It is located **[here](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/heartAttacks.zip)**. 

Please answer the following question with this data: 

> It is well-known that men are more likely to have heart attacks than women.  How does gender (```sex```) influence some of the heart attack predictors like resting blood pressure (```trestbps```) and cholesterol (```chol```)? 

In order to do so, you will need to do the following:

* Test for MANOVA assumptions
* Run a MANOVA

Don't worry about correcting for any violations you may encounter, since you have not yet been taught how to overcome them.

Please submit your R studio file, with a one-sentence conclusion to answer the above question in the comments.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>