<div align ="right">Thomas Jefferson University <b>COMP 103</b>: Data Analysis and Visualization</div>

In [2]:
import matplotlib.pyplot as plt
import pandas as pd

# Comparing means with ANOVA

The next means comparison strategy we will investigate is the ANOVA, or ANalysis Of VAriance. This is a means comparison strategy that is often used in the life sciences. The name can be a bit confusing, as this test detects differences among means by calculating the ratios between different variance components - specifically the ratio of the variance between samples (which corresponds to the variance among species, experimental treatments, localities, or whatever else we may be comparing) to the variance within samples (the error or residual variance, variability in the data that cannot be attributed to our experimental factors).  

Similarly to how the the t-test assigned statistical significance by calculating a t-statistic that is then compared to a table of t-values, the ANOVA assigns statistical significance by calculating an **F-statistic** that is compared to a table of F-values. Most simply this can be represented as:

F = <u>between sample variance</u>   
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;within sample variance

In this notebook we will compare three different types of ANOVA:
* In a **one-way ANOVA**, we will compare multiple sample means
* In a **two-way ANOVA**, we will compare sample means that are affected by more than one experimental factor
* In a **randomized block ANOVA**, which we will just touch on briefly, we will look at how random factors can be analyzed

**Note 1** ANOVAs can get very complicated very quickly! ANOVAs are calculated differently when sample sizes between groups are different or the same, or when the factors included in an ANOVA are fixed (meaningful beyond the bounds of the experiment being considered, such as level of fertilizer applied, or sex of the study organism) or random (meaning factors that might be expected to vary in unpredictable ways, such as different cages in which lab mice are housed, or fields in which plants are grown). Understanding these differences is very important and planning an experiment that requires ANOVA should be done in consultation with a statistician or an experienced practitioner. 

**Note 2** Many of the resources you will find online about using ANOVA will assume that you are using R statistical software. Running an ANOVA in R will have a different syntax than in python, but the fundamentals are very similar. 


# Comparing multiple means with one-way ANOVA

The simplest application of ANOVA is to compare three or more means to one another. If a two-sample t-test asks 'Does sample A differ from sample B?' a one way ANOVA asks 'Do samples A, B, C, ... differ from one another?'

Let's load up a dataset. In the `data` folder is a file called `test_aves.csv`. In the code window below, load that dataset up and print out the header to make sure you have it. 

The data set consists of a series of students from four universities, Male and Female, and their average exam scores in three areas: calculus, chemistry, and biology.

In [6]:
test_aves = pd.read_csv('data/test_aves.csv')   ## FIX SUPPRESS INDEX
test_aves.head()

Unnamed: 0.1,Unnamed: 0,univ,sex,calc,chem,biol
0,0,ASU,M,77.735773,87.051088,88.836237
1,1,BSU,F,84.40443,86.571663,81.843252
2,2,CSU,M,78.791048,88.657529,92.983174
3,3,DSU,F,76.030194,79.144937,87.765178
4,4,ASU,M,83.397229,83.455723,87.392752


Hopefully you see the data in front of you. For now, we are going to ignore the university and sex data and just look at the numbers as if we didn't have those other factors available to us. [Maybe make a data set that doesn't have the university and sex data in it.  so these can be independent observations]

First things first, before we start comparing means, let's explore the data. ANOVA has three major assumptions. 1) that the data are independent of one another 2) that the data is drawn from a normal distribution, and 3) that the groups being compared have similar variances. Assume that the scores are a random collection of scores from each university, such that each data series in the dataframe is independent, or in other words that the rows don't mean anything.  

### Exercise 1. 

In the code window below, take a pass at using the tools we have learned so far to demonstrate that each of these three conditions is met. You can use visualizations or other statistical tests to do so. 


In [8]:
###
### Your code here
###
test_aves.describe()



Unnamed: 0.1,Unnamed: 0,calc,chem,biol
count,1000.0,1000.0,1000.0,1000.0
mean,499.5,82.251045,85.017736,87.96173
std,288.819436,3.952268,4.17588,4.042939
min,0.0,70.191306,73.188534,75.356028
25%,249.75,79.677506,82.313789,85.203013
50%,499.5,82.38943,85.047946,87.913294
75%,749.25,84.763142,87.959997,90.806142
max,999.0,94.562253,98.846922,100.0


Ok HERE we need a description of how to run a one-way ANOVA using python - ADD THIS TEXT 

### Exercise 2. 

Run a one-way anova and visualize the results.  

compare to a boxplot of the means

Reporting an ANOVA is more complicated than just F, and p, however, see this for how to make a table. https://www.reneshbedre.com/blog/anova.html

### Exercise 3. 

Make an ANOVA table 



## Post hoc tests with one-way ANOVA

TEXT describe post-hoc tests
### Exercise 4. 

run a post-hoc test on the ANOVA data

# Two-way analysis of variance

Description

### Exercise 4.  

run two way ANOVA on data sex vs. score , definitely use stats models for this

![TJU logo image](images/TJU_logo_image.png "TJU logo image")