# MANOVAs in R

# Load Libraries
library("mvnormtest")
library("car")

In [2]:
# Load Data
kickstarter <- read.csv("kickstarter.csv")


## Question Setup

### Does the country the project originated in influence the number of backers and the amount of money pledged?

# Details
To answer this question, the independent variable will be the country the project originated in, country. This is a categorical variable. The two dependent variables will be the number of backers (backers) and the amount pledged (pledged). These variables are both continuous.

## Data Wrangling
Although no data wrangling is actually required for the MANOVA itself, some wrangling is required to test for assumptions. In order to test for multivariate normality, you will need to create a dataset containing only your two dependent variables that is in a matrix format, and you will need to ensure that they are numeric. Unfortunately, the test for normality can only handle 5,000 records, so you will also need to limit your data to 5,000 rows as well.

In [3]:
# Ensure the variables are numeric
str(kickstarter$pledged)
str(kickstarter$backers)

 chr [1:323750] "0" "220" "1" "1283" "52375" "1205" "453" "8233" "6240.57" ...
 chr [1:323750] "0" "3" "1" "14" "224" "16" "40" "58" "43" "0" "100" "0" ...


In [4]:
kickstarter$pledged <- as.numeric(kickstarter$pledged)
kickstarter$backers <- as.numeric(kickstarter$backers)

"NAs introduced by coercion"
"NAs introduced by coercion"


In [6]:
# Subsetting
keeps <- c("pledged", "backers")
kickstarter1 <- kickstarter[keeps]

In [7]:
# Limit the number of rows
kickstarter2 <- kickstarter1[1:5000,]


In [8]:
# Format as Matrix
kickstarter3 <- as.matrix(kickstarter2)


### Test Assumptions

### Sample Size
The first assumption of MANOVAs is sample size. The rule of thumb is that you must have at least 20 cases per independent variable, and that there must be more cases then dependent variables in every cell. Meaning that there must be more than 2 cases for each country. Happily, both of these are fulfilled with a dataset of 323,746!

### Multivariate Normality 

In [9]:
# Drop any missing values
kickstarter4 <- na.omit(kickstarter3)


In [10]:
mshapiro.test(t(kickstarter4))



	Shapiro-Wilk normality test

data:  Z
W = 0.07914, p-value < 2.2e-16


### Results
You have violated the assumption of multivariate normality if the p value is significant at p < .05, so unfortunately, these data do not meet the assumption for MANOVAs. However, for learning purposes, you will continue.

## Homogeneity of Variance

In [11]:
leveneTest(kickstarter$pledged, kickstarter$country, data=kickstarter)


"kickstarter$country coerced to factor."


Unnamed: 0_level_0,Df,F value,Pr(>F)
Unnamed: 0_level_1,<int>,<dbl>,<dbl>
group,23,22.13075,5.661519e-93
,323102,,


In [12]:
leveneTest(kickstarter$pledged, kickstarter$country, data=kickstarter)


"kickstarter$country coerced to factor."


Unnamed: 0_level_0,Df,F value,Pr(>F)
Unnamed: 0_level_1,<int>,<dbl>,<dbl>
group,23,22.13075,5.661519e-93
,323102,,


### Results
Unfortunately, neither variable met the assumption of homogeneity of variance, since they were both significant at p < .05. You have violated the assumption of homogeneity of variance, but you will proceed for now for learning purposes.



## Absence of Multicollinearity


In [15]:
cor.test(kickstarter$pledged, kickstarter$backers, method="pearson", use="complete.obs")


	Pearson's product-moment correlation

data:  kickstarter$pledged and kickstarter$backers
t = 615.89, df = 323124, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7332546 0.7364268
sample estimates:
      cor 
0.7348447 


## The Analysis 

In [16]:
MANOVA <- manova(cbind(pledged, backers) ~ country, data = kickstarter)
summary(MANOVA)

              Df   Pillai approx F num Df den Df    Pr(>F)    
country       23 0.032996   235.65     46 646204 < 2.2e-16 ***
Residuals 323102                                              
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## ANOVAs as Post Hocs

In [18]:
summary.aov(MANOVA, test = "wilks") 

 Response pledged :
                Df     Sum Sq    Mean Sq F value    Pr(>F)    
country         23 4.1367e+12 1.7985e+11  22.445 < 2.2e-16 ***
Residuals   323102 2.5891e+15 8.0133e+09                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response backers :
                Df     Sum Sq   Mean Sq F value    Pr(>F)    
country         23 4.0433e+09 175796074  198.45 < 2.2e-16 ***
Residuals   323102 2.8622e+11    885848                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

624 observations deleted due to missingness

### Results:
 As you can see here, there is a significant difference in both the amount of funds pledged and the number of backers by country.