# DS105 Intermediate Statistics : Lesson Four Companion Notebook

### Table of Contents <a class="anchor" id="DS105L4_toc"></a>

* [Table of Contents](#DS105L4_toc)
    * [Page 1 - Introduction](#DS105L4_page_1)
    * [Page 2 - ANOVAs](#DS105L4_page_2)
    * [Page 3 - Assumptions for ANOVAs](#DS105L4_page_3)
    * [Page 4 - One Way ANOVAs in R](#DS105L4_page_4)
    * [Page 5 - Computing ANOVAs with Equal Variance (Met Homogeneity of Variance Assumption)](#DS105L4_page_5)
    * [Page 6 - One-Way ANOVA in R Activity](#DS105L4_page_6)
    * [Page 7 - One-Way ANOVA in R Activity Solution](#DS105L4_page_7)
    * [Page 8 - One Way Between Subjects ANOVAs in Python](#DS105L4_page_8)
    * [Page 9 - Computing ANOVAs with Equal Variance](#DS105L4_page_9)
    * [Page 10 - One Way ANOVA in Python Activity](#DS105L4_page_10)
    * [Page 11 - One Way ANOVA in Python Activity Solution](#DS105L4_page_11)
    * [Page 12 - Key Terms](#DS105L4_page_12)
    * [Page 13 - Lesson 4 Hands-On](#DS105L4_page_13)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 1 - Introduction<a class="anchor" id="DS105L4_page_1"></a>

[Back to Top](#DS105L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

In [1]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Introduction to ANOVA
VimeoVideo('257515870', width=720, height=480)

# Introduction

Through this course, thus far you have only dealt with statistical tests with one x variable and one y variable.  Now, you'll begin working with statistics that can handle more than one x! This branch of statistics is called *multivariate statistics*, since it deals with multiple variables.  The first multivariate statistic you will encounter is the ANOVA, which stands for *analysis of variance*.  

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>You may want to watch this <a href="https://vimeo.com/444752189"> recorded live workshop </a> that goes over how to do ANOVAs in R. </p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 2 - ANOVAs<a class="anchor" id="DS105L4_page_2"></a>

[Back to Top](#DS105L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [2]:
from IPython.display import VimeoVideo
# Tutorial Video Name: ANOVAs
VimeoVideo('335514395', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L04pg2tutorial.zip)**.

# What is an ANOVA?

ANOVAs are wonderful! You will quickly fall in love with them! They are versatile, robust, and relatively simple to understand and conduct.  ANOVAs are where your true intermediate statistics journey begins.

You can think of an ANOVA as a *t*-test on steroids.  Depending on the type of ANOVA, it can either replace your independent *t* test, your dependent *t* test, or it can even handle BOTH scenarios at once.  Is your mind blown yet? Ready to convert to ANOVAs for life? 

You will use an ANOVA to compare the means of the different levels of an independent variable(s). The independent variable will be a categorical variable.  Unlike your independent *t*-test, which can only handle two levels of the independent variable, or two groups at a time, the ANOVA can handle more than two levels of your independent variable.  The dependent variable will remain continuous, and there is still only one.  So, if you have a theoretical example in which you want to determine whether color affects the fluffiness of a dog's coat, in a *t*-test, you could only compare two different coat colors at a time.  Maybe black and white.  But with an ANOVA, you can look at more than two levels of the independent variable, so you can compare more coat colors at a time - maybe black, white, apricot, and brown.  In this scenario, the coat color is the independent variable, also sometimes called the grouping variable, because it is made up of groups, and the dependent variable is a continuous measure of coat fluffiness, for each dog you have in your sample.   

Basically, the concept behind the ANOVA is that you are seeing if the variance in the dependent variable is in any way related to the grouping of the independent variable. Is there a pattern, in which certain groups have higher or lower means? 

---

# Types of ANOVAs

There are two different types of ANOVAs: between subjects and within subjects ANOVAs.  

---

## Between Subjects ANOVAs

*Between Subjects ANOVAs* fall in the same class as independent *t*-tests and independent Chi-Squares.  All of these analyses look for differences between two separate, or independent, groups.  There is no overlap between one group and the next.  It is called a between subjects ANOVA because you are looking at mean difference between people, often called subjects in a research study.  

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>Sometimes in the biological fields, this is called a Type I ANOVA instead of a between subjects ANOVA. </p>
    </div>
</div>

---

## Within Subjects ANOVAs

*Within Subjects ANOVAs* fall in the same class as dependent *t*-tests.  They are used when you have paired data or related samples.  Most often, this is done by looking at change over time.  However, a dependent *t*-test can only handle a pre- and post-test design, whereas, since ANOVAs can have multiple levels, they can handle additional timepoints.  Anything from "beginning, middle, end" designs to looking at time bins for time periods can be done with an ANOVA.  

Within subjects ANOVAs get their name because you are looking at the same person, or research subject, over and over again.  Because within subjects ANOVAs are so often used to look at things over time, they are often called *repeated measures ANOVAs* as well. 

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>Sometimes in the biological fields, this is called a Type II ANOVA instead of a within subjects ANOVA. </p>
    </div>
</div>

---

## Number of Independent Variables 

With both between and within subjects ANOVAs, you can also add a clarifier on that deals with the number of independent variables.  So there are also one-way and factorial ANOVAs, which can be either between or within subjects. 

---

### One-Way ANOVAs

When an ANOVA has only one independent variable, regardless of the number of levels, than this is called a *one-way ANOVA*. 

---

### Factorial ANOVAs 

When there is more than one independent variable, than as a whole, it is a *factorial ANOVA*.  However, ANOVAs are typically named by the number of independent variables they contain, so if you had two independent variables, it would be called a *two-way ANOVA*, and if you had three independent variables, it would be called a *three-way ANOVA*, etc.  Although there is no limit to the number of independent variables you can have in an ANOVA, as long as you have enough data, practically going above a three or four way ANOVA becomes very difficult to interpret, so it is not recommended. 

---

## Mixed Measures ANOVAs 

In a factorial ANOVA, where there are multiple variables, you can mix and match your variables, so that there is one or more between subjects variables as well as a within subject variable.  This is called a *mixed measures* or *mixed design* ANOVA.  

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 3 - Assumptions for ANOVAs<a class="anchor" id="DS105L4_page_3"></a>

[Back to Top](#DS105L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [3]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Assumptions for ANOVAs
VimeoVideo('335514530', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L04pg3tutorial.zip)**.

# Assumptions for ANOVAs

Just like any statistical test, there are assumptions for ANOVAs that you need to meet in order for them to be the most effective and the least biased.  

---

## Normality

This assumption should be very familiar to you! Just like most statistics, ANOVAs require your data to be normally distributed, or distributed as close to normal as possible.  If your data are not normal, they will need to be transformed to approximate the normal distribution. Luckily, the ANOVA is relatively robust, so especially as your sample size increases, and if you keep your group sizes relatively equal, it can handle some deviations from normality.

---

## Homogeneity of Variance

On the simplest level, *homogeneity of variance* means the variance, or the spread of data, is equal.  The root *homo* means same, so think of homogeneity as "same variance." In the context of an ANOVA assumption, it means that the variance of one variable should not affect the variance of another variable. 

It's probably easier to look at an illustration: 

![A graph depicts the comparison of GPR, Kernel Ridge, and SVR. The x-axis represents data and the y-axis represents target. There are four curves plotted to represent true, SVR, KRR, and GPR. Plotted points represent the data.](Media/ANOVA1.png)

See how the means change some along the x axis, from the first curve to the second curve, but that the actual spread of the data is relatively the same? The distance between the highest and the lowest dot on the first curve is approximately the same as the distance between the highest and lowest dot on the second curve, but the actual placement of those dots is relative to the mean.

If some of these curves had a really wide spread of data, or a really small spread, then that wouldn't be equal variance.  Instead, the variance would be unequal, which is also called *heterogeneity of variance*. The root word *hetero* means difference, so you can think of heterogeneity as "difference variance." 

---

## Sample Size

You must have an adequate sample size in order for ANOVAs to effectively test for differences between groups. Typically, a higher sample size is required the more complex your analysis is, and ANOVAs are no exception.  You must have a least 20 cases per independent variable. 

---

## Independence 

The assumption of independence means that your groups must be unrelated, or independent, of each other (except for within subjects designs). You could theoretically test this by correlating each level of your independent variables with each other, but it's typically not done.  Really, meeting the assumption of independence is just more about how you are setting up your "experiment" or choosing your data to analyze.  Make sure that there is no overlap between your groups and that the levels you are testing are not related in some way, like having the same people.

If you don't meet the assumption of independence, you are much more likely to commit Type I error, saying something is significant when it's really not. And there's no way to correct for this assumption - it's better just to not run an ANOVA if you feel you have violated the assumption of independence.  

---

## Sphericity (for Between Subjects Designs Only)

Sphericity is a lot like homogeneity of variance, but applies specifically to repeated measures or between subjects designs.  Say, for instance, you are measuring the activity level of dogs in their first year, second year, third year, and fourth year.  If there is sphericity, then the association between each set of years should be approximately the same: 1 - 2, 1-3, 1-4, 2-3, and 2-4. However, when you're dealing with changes over time, it is often likely that things that happen closer together in time are more closely associated with each other.  As your dog ages, you would expect that he or she would become a little less energetic.  So the correlation between year 1 and year 2 activity rates is probably much higher than the correlation between the year 1 and year 4 activity rates, for instance.  When sphericity is present in a between subjects design, there are many ways in which you can correct for it.  

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 4 - One Way ANOVAs in R<a class="anchor" id="DS105L4_page_4"></a>

[Back to Top](#DS105L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [4]:
from IPython.display import VimeoVideo
# Tutorial Video Name: One Way ANOVAs in R
VimeoVideo('335516500', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L04pg4-5tutorial.zip)**.

# One Way Between Subjects ANOVAs in R

Now that you have a basic idea about what an ANOVA is, you will learn how to create ANOVAs in R, starting with the One Way ANOVA.

---

## Load Libraries

ANOVAs come as part of the base package in R, so the only libraries you will need to load in are ```dplyr``` because you'll need it for some data wrangling, ```rcompanion``` because you'll use it to check for the assumption of normality, and ```car``` if you need to run an ANOVA that will correct for a violation of homogeneity of variance. 

```{r}
library("dplyr")
library("rcompanion")
library("car")
```

---

## Load in Data

You will be examining data about the apps in the **[Google Play Store](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/NEWgoogleplaystore.zip)**.  

---

## Question Setup

With this data, you will answer the question: 

```text
Is there a difference in price among the three app categories of beauty, food and drink, and photography? 
```

In order to answer this question, your x, or independent variable, will be the app categories, which has three levels: beauty, food and drink, and photography.  Your y, or dependent variable, will be the price.  As with all ANOVAs, the IV will be categorical, and the DV will be continuous.

---

## Data Wrangling

Depending on the data that you've been given, it may need some wrangling! 

---

### Filter the Data and Remove Missing Values


In this case, the data has many more categories than three, so you will need to filter the dataset by the categories you want: beauty, food and drink, and photography.

```{r}
apps <- na.omit(NEWgoogleplaystore %>% filter(Category %in% c("BEAUTY", "FOOD_AND_DRINK", "PHOTOGRAPHY")))
```

Now you are all prepared to run a one-way ANOVA.

---

### Make Price Numeric

You will also need to make your dependent variable, price, numeric, so that you can test for some of your assumptions:

```{r}
apps$Price <- as.numeric(apps$Price)
```

---

## Test Assumptions

Before you go any further, it's important to test for assumptions.  If the assumptions are not met for ANOVA, but you proceeded anyway, you run the risk of biasing your results. 

---

### Normality

You only need to test for the normality of the dependent variable, since the IV is categorical. 

```{r}
plotNormalHistogram(apps$Price)
```

Here is the result: 

![A graph depicts the plot of x against frequency. A curve and a bar is plotted on the graph. The x-axis ranges from 0 to 100 and the y-axis ranges from 0 to 500. The peak position of the curve reaches the tallest bar.](Media/GPSANOVAHisto.png)

Looks like that isn't normal in any way - it is very highly positively skewed. So, you'll try to transform price by square rooting or cube rooting the column. 

```{r}
apps$PriceSQRT <- sqrt(apps$Price)
plotNormalHistogram(apps$PriceSQRT)
```

Here's the result of the squaring: 

![A graph depicts the plot of x against frequency. A curve and a bar is plotted on the graph. The x-axis ranges from 0 to 8000 and the y-axis ranges from 0 to 500. The peak position of the curve reaches the tallest bar.](Media/GPSANOVAsqrt.png)

So that hasn't made any improvements.  Try cubing it: 

```{r}
apps$PriceCUBE <- apps$Price ^ 3
plotNormalHistogram(apps$PriceCUBE)
```

With this result: 

![A graph depicts the plot of x against frequency. A curve and a bar is plotted on the graph. The x-axis ranges from 0 e plus 00 to 8 e plus 05 and the y-axis ranges from 0 to 500. The peak position of the curve reaches the tallest bar.](Media/GPSANOVAcubed.png)

Looks like neither of these are really any better than the original, so you might as well keep the original data to ease interpretation.  ANOVA is somewhat tolerant of violations of normality when you have a large sample size. Your other option would be to run another analysis that did not require normality.

---

### Homogeneity of Variance

You can test for homogeneity of variance easily using either Bartlett's test or Fligner's Test.  Bartlett's test is for when your data is normally distributed, and Fligner's test is for when your data is non-parametric. No matter which test you are using, you are looking for a non-significant test.  The null hypothesis for both of these is that the data has equal variance, so you'd like to have a *p* value of > .05.  You have already determined your data is not normally distributed, so ordinarily you would just perform Fligner's test, but just for learning purposes, you'll try both here.

---

#### Bartlett's Test

To do Bartlett's test, use the function ```bartlett.test()```, with the argument of the y data separated by a tilde, followed by the x data.  Then there's an argument ```data=```, which is where you will specify the name of your dataset. 

```{r}
bartlett.test(Price ~ Category, data=apps)
```

Here is the output: 

```text
	Bartlett test of homogeneity of variances

data:  Price by Category
Bartlett's K-squared = Inf, df = 2, p-value < 2.2e-16
```

The *p* value associated with this test is < .05, which means that unfortunately, you have violated the assumption of homogeneity of variance.

---

#### Fligner's Test

To perform Fligner's test, use the function ```fligner.test()```, with the argument of the y data separated by a tilde, followed by the x data.  Then there's an argument ```data=```, which is where you will specify the name of your dataset. 

```{r}
fligner.test(Price ~ Category, data=apps)
```

Here is the output: 

```text
	Fligner-Killeen test of homogeneity of variances

data:  Price by Category
Fligner-Killeen:med chi-squared = 8.1952, df = 2, p-value = 0.01661
```

Although this test is less significant that Bartlett's, because you have run the correct test for your data, the *p* value is still < .05, which means you have violated the assumption of homogeneity of variance.

---

#### Correcting for Violations of Homogeneity of Variance

There are two ways that you can correct for a violation of homogeneity of variance.  The first is the BoxCox transformation of your data, and the second is running a slightly different type of ANOVA, one that was created specifically to handle this violation. That test is called the *Welch One-Way Test*, and you'll learn about this ANOVA option.

---

### Sample Size

An ANOVA requires a sample size of at least 20 per independent variable.  In this case, you only have one independent variable, so as long as you have at least 20 cases, you are fine.  Looking at the data, the *n* is 515, so you are fine to proceed with this assumption!

---

### Independence 

There is no statistical test for the assumption of independence.  

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>If you would like to get deeper into the assumptions for ANOVA, check out <a href="https://sites.ualberta.ca/~lkgray/uploads/7/3/6/2/7362679/slides_-_anova_assumptions.pdf">this resource</a> from the University of Alberta. </p>
    </div>
</div>


---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 5 - Computing ANOVAs with Equal Variance (Met Homogeneity of Variance Assumption)<a class="anchor" id="DS105L4_page_5"></a>

[Back to Top](#DS105L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [5]:
from IPython.display import VimeoVideo
# Tutorial Video Name: One Way Betweeen Subjects ANOVAs in R
VimeoVideo('335515713', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L04pg4-5tutorial.zip)**.

# Computing ANOVAs with Equal Variance (Met Homogeneity of Variance Assumption)

In this case, your data met this assumption, so this is the appropriate ANOVA to compute.    

Below is the code to run a one-way ANOVA in R.  You can give your ANOVA a name; this one is named ```appsANOVA```.  Then you want to use the function ```aov()```. The argument for this function is your y variable, which is continuous, followed by a tilde and then your x variable, which is categorical.  Remember that the tilde reads as "by," so you can think of this as analyzing price by category.

```{r}
appsANOVA <- aov(apps$Price ~ apps$Category)
```

Here is an example of the ```summary()``` function: 

```{r}
summary(appsANOVA)
```

Which will provide the following output: 

```text
               Df Sum Sq Mean Sq F value Pr(>F)  
apps$Category   2    5.2   2.617   1.601 0.203 *
Residuals     465  760.2    1.635                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```
The first row of the output has the ```Df```, or degrees of freedom. The row for your category is calculated as 1 - # of Levels, so that is always a good gut check. Next, you have rows for the ```Sum Sq``` and ```Mean Sq```; these are just some of the calculations that went into getting your *F*-value, which is the test statistic for an ANOVA. The real meat that you want to pay attention to is the F value itself and the associated *p*-value next to it. Like anything else, if this value is less than .05, the test was significant. If you ever need a reminder of that, you can look at the star and ```Signif. codes``` down at the bottom. The p value is above 0.05 and there's no star next it, so the is not significant.  

---

# Computing ANOVAs with Unequal Variance (Violated Homogeneity of Variance Assumption)

If you need to correct for violating the assumption of homogeneity of variance, you can run an ANOVA that was meant to correct for that violation, using a Welch's One-Way Test.  To do this, you will actually create a linear model first, and then use the function ```Anova()``` on it.  

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Fun Fact!</h3>
    </div>
    <div class="panel-body">
        <p>The distribution and statistics behind regression and ANOVAs are the same! However, they way you use and conceptualize them is very different. </p>
    </div>
</div>

Here's how this will look: 

```{r}
ANOVA <- lm(Price ~ Category, data=apps)
Anova(ANOVA, Type="II", white.adjust=TRUE)
```

First, create and name a linear model that uses the same set up as the ANOVA with equal variance.  Then, call the ```Anova()``` function on that named model, include the argument of ```Type=``` and set it to ```"II"``` because this is a between subjects ANOVA, and then use the argument ```white.adjust=TRUE```.  This last part, setting ```white.adjust=``` to ```TRUE```, is what makes this ANOVA appropriate when you have unequal variance. 

Here is the output R provides you with: 

```text
Analysis of Deviance Table (Type II tests)

Response: Price
           Df    F   Pr(>F)    
Category    2 6.3142 0.00197 ***
Residuals 465                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

It provides a little less information in terms of the math behind calculating the *F* statistic, but really, all the information you need to interpret the data is there. This is significant at *p* < .01, so you can conclude that there is a significant difference in price somewhere between the three levels of your independent variable.

---

## Post Hocs

Now the problem with an ANOVA is that you have multiple groups.  When you found significance with a *t*-test, you were able to just look at the means and you knew where the significant differences lie.  You knew what was higher, and what was lower.  But with an ANOVA, you can't just look at the means right away, because the *F* and associated *p* value just let you know that there is a difference between at least set of the three categories.  In your example, the mean prices could be different between the beauty and food and drink category, the beauty and photography category, the food and drink and photography category, or some combination of those three! 

That's where *post hocs* come in.  They are specifically designed to test all the pairs between your data, which is why they are also often known as *pairwise comparisons*. This is done with *t*-tests.  But the inherent problem with using multiple *t*-tests is that the more analyses you run, the more you increase your chances of Type I error. So you're more likely to find something significant when it really isn't. So, typically a post hoc will apply a correction, or adjustment, so that the *t*-tests become more stringent, and you are therefore correcting for doing multiple *t*-tests by applying stricter criteria.  That way your Type I error doesn't run rampant! 

There are many different corrections you can apply. But the most common ones you'll hear about are Tukey, Bonferroni, Holm, and Scheffe.  All named by after the people who came up with them, by the way.  These three are in order of how much correction they apply, with Tukey applying the least correction and Scheffe applying the most.  Unfortunately, R does not compute Tukey and Scheffe automatically, so you'll stick to exploring the difference between no correction at all, and a Bonferroni correction. 

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>If you would like to learn how to calculate Tukey and Scheffe in R, check out <a href="https://psych.wisc.edu/moore/Rpdf/610-R3_post-hoc_one-way_betw.pdf">this resource</a> from the University of Wisconsin.</p>
    </div>
</div>

---

### Computing Post Hocs with No Adjustment

Here is the code for computing a post hoc in R: 

```{r}
pairwise.t.test(apps$Price, apps$Category, p.adjust="none")
```

Use the ```pairwise.t.test()``` function, with the arguments of the two variables you are crossing, and the argument ```p.adjust=```.  To show you why a correction is necessary, you will start out with a value of ```"none"```, which means that no correction is being made for Type I error. Here are the results: 

```text

	Pairwise comparisons using t tests with pooled SD 

data:  apps$Price and apps$Category 

               BEAUTY FOOD_AND_DRINK
FOOD_AND_DRINK 0.74  -             
PHOTOGRAPHY    0.19  0.16         

P value adjustment method: none 
```

What is presented in the matrix above is the *p*-values for each *t*-test between the pairs of the levels of your independent variable. Reading this, you can see that there was not a *significant difference* in price between any of the pairs of apps. 

---

### Computing Post Hocs with Bonferroni Adjustment

You may be pretty pleased with finding a significant difference in price between app categories.  But guess what? That difference may not really exist, because by running three *t*-tests, you may have increased your Type I error.  So, better to typically stick with some form of correction, like Bonferroni.  It is relatively "mild," but gets the job done! 

```{r}
pairwise.t.test(apps$Price, apps$Category, p.adjust="bonferroni")
```

And here are the results: 

```text 
	Pairwise comparisons using t tests with pooled SD 

data:  apps$Price and apps$Category 

               BEAUTY FOOD_AND_DRINK
FOOD_AND_DRINK 1.000  -             
PHOTOGRAPHY    0.56   0.48         

P value adjustment method: bonferroni 
```

Gasp! You find that your findings are even less significant. Notice the comparison between food and drink and beauty apps. Since a *p*-value can only be between 0 and 1, that's the end of line; as non-significant as something gets. This has just demonstrated why it's important to always, always, apply a correction to your post hocs!

---

### Computing Post Hocs When You've Violated the Assumption of Homogeneity of Variance

There is an easy solution to computing post hocs when you have violated the assumption of homogeneity of variance.  You'll use the same codes as above, but include the argument ```pool.sd = FALSE``` at the end.  Like this: 

```{r}
pairwise.t.test(apps$Price, apps$Category, p.adjust="bonferroni", pool.sd = FALSE)
```

This provides a very similar output, the only difference being that is was calculated with non-pooled standard deviations, as noted at the top.

```text
	Pairwise comparisons using t tests with non-pooled SD 

data:  apps$Price and apps$Category 

               BEAUTY  FOOD_AND_DRINK
FOOD_AND_DRINK 0.4943 -             
PHOTOGRAPHY    0.0035 0.1470       

P value adjustment method: bonferroni 
```

As you can see, once you've corrected for this assumption, your results have changed and your pairwise comparison between both photography and beauty apps is significant.  

---

## Determine Means and Draw Conclusions

If you had found a significant difference after correction, you would want to then finish interpreting the results and draw some conclusions.  To do that, you need to examine the means! Again, ```dplyr``` nicely comes to the rescue. 

```{r}
appsMeans <- apps %>% group_by(Category) %>% summarize(Mean = mean(Price))
```

Here's the result: 

![A table with two columns and three rows. The column headings are labeled category and mean. The row entries are as follows. Row 1, beauty, 0.00000000. Row 2, food and drink, 0.07779817. Row 3, photography, 0.27835962.](Media/DSO105L04P5.png)

The post-hoc tests for this data that meets the assumption of homogeneity of variance did not result in any significant differences between apps. Looking at these means, it makes sense that the differences are very small!

<div class="panel panel-success">
    <div class="panel-heading">
        <h3 class="panel-title">Additional Info!</h3>
    </div>
    <div class="panel-body">
        <p>If you would like a little more re-iteration about ANOVAs in R, done a slightly different way, check out<a href="https://www.youtube.com/watch?v=lpdFr5SZR0Q">this video</a></p>
    </div>
</div>

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 6 - One-Way ANOVA in R Activity<a class="anchor" id="DS105L4_page_6"></a>

[Back to Top](#DS105L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


For your Activity, you will be computing a one-way, between subjects ANOVA to see if the video views on YouTube differs by the grade it received. `This Hands-On will not be graded`, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

Using the **[YouTube Channels dataset](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/YouTubeChannels.zip)**, determine if there is a difference in the number of views (```Video.views``` differs between all the different grade categories (```Grade```). To do this, you will need to: 

* Test for all assumptions and correct for them if necessary
* Run the appropriate ANOVA based on your assumptions
* If significant, run the appropriate post hoc based on your assumptions
* Interpret your results 

Then write an overall, one-sentence conclusion about this data analysis. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 7 - One-Way ANOVA in R Activity Solution<a class="anchor" id="DS105L4_page_7"></a>

[Back to Top](#DS105L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


# Solution

```{r}
# See how the views differ by Grade

# Testing assumptions

# Normality
plotNormalHistogram(YouTubeChannels$Video.views)

#Square root it
YouTubeChannels$Video.viewsSQRT <- sqrt(YouTubeChannels$Video.views)
plotNormalHistogram(YouTubeChannels$Video.viewsSQRT)

#Better, try log just in case
YouTubeChannels$Video.viewsLOG <- log(YouTubeChannels$Video.views)
plotNormalHistogram(YouTubeChannels$Video.viewsLOG)
#Log went too far, stick with SQRT

# Homogeneity of Variance
bartlett.test(Video.viewsSQRT ~ Grade, data=YouTubeChannels)
# Does not meet the assumption for homogeneity of variance

# Do the Test, with unequal variance
ANOVA1 <- lm(Video.viewsSQRT ~ Grade, data=YouTubeChannels)
Anova(ANOVA1, Type="II", white.adjust=TRUE)

# Do the Post Hocs with unequal variance
pairwise.t.test(YouTubeChannels$Video.viewsSQRT, YouTubeChannels$Grade, p.adjust="bonferroni", pool.sd = FALSE)

# Find means and draw conclusions
YouTubeChannelsMeans <- YouTubeChannels %>% group_by(Grade) %>% summarize(Mean = mean(Video.views))
# All grades significantly differ from all other grades in the number of views they receive, with the higher grades typically getting more views. 
```

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 8 - One Way Between Subjects ANOVAs in Python<a class="anchor" id="DS105L4_page_8"></a>

[Back to Top](#DS105L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [6]:
from IPython.display import VimeoVideo
# Tutorial Video Name: One Way Between Subjects ANOVAs in Python
VimeoVideo('335518678', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L04pg8-9tutorial.zip)**.

# One Way Between Subjects ANOVAs in Python

Now that you have a basic idea about what an ANOVA is, you will learn how to create ANOVAs in Python, starting with the One Way ANOVA. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Python was not created specifically with statistics in mind, like R was, so there are far fewer options and it is more difficult to test assumptions.  You will work within these limitations, but know that if you need to fully explore ANOVAs, R will be your best bet. </p>
    </div>
</div>

---

## Import Packages

ANOVAs can be done as part of ```scipy```, so you will need that package.  You will also need some packages from ```statsmodels``` in order to test for assumptions.  In addition, you'll need ```pandas``` to load in your data and ```numpy``` to transform your data to meet the assumption of normality. 

```python
import pandas as pd
import numpy as np
import scipy
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.multicomp import MultiComparison
```

---

## Load in Data

You will be examining **[data about the apps in the Google Play Store](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/googleplaystore.zip)**.  

---

## Question Setup

With this data, you will answer the question: 

```text
Is there a difference in the number of reviews among the three app categories of beauty, food and drink, and photography? 
```

In order to answer this question, your x, or independent variable, will be the app categories, which has three levels: beauty, food and drink, and photography.  Your y, or dependent variable, will be the reviews.  As with all ANOVAs, the IV will be categorical, and the DV will be continuous.

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>You aren't using "Price" like before with R because that variable has an embedded dollar sign value that makes things tricky in Python.</p>
    </div>
</div>

---

## Data Wrangling

Depending on the data that you've been given, it may need some wrangling! In this case, although you can run the actual ANOVA using the original data, you can't test the assumptions or run the post hocs unless the data is wrangled.  

---

### Focusing on the Three Categories

The data has many more categories than three, so you will need to filter the dataset by the categories you want: beauty, food and drink, and photography.

```python
categories = ['BEAUTY', 'FOOD_AND_DRINK','PHOTOGRAPHY']
apps1 = apps['Category'].isin(categories)
apps2 = apps[apps1].copy()
```

The code above makes a list of the categories you want to keep, then searches through the ```Category``` column using the ```isin()``` function to keep only those that match.  Then, you can apply that list to your actually data frame, being sure to you use the ```.copy()``` function to change this from a slice into a data frame. 

---

### Subsetting to Only the Variables Needed

You only want to keep the two variables you'll need in your test: ```Category``` and ```Reviews```.

```python
apps3 = apps2[['Category','Reviews']]
```

---

### Changing ```Reviews``` to an Integer 

Your dependent variable will need to be an integer.  You can check what format it is in by using the ```.info()``` function: 

```python
apps3.info()
```

Here is the result: 

```text
<class 'pandas.core.frame.DataFrame'>
Int64Index: 515 entries, 98 to 10740
Data columns (total 2 columns):
Category    515 non-null object
Reviews     515 non-null object
dtypes: object(2)
memory usage: 12.1+ KB
```

Note that both ```Category``` and ```Reviews``` is a non-null object (string). You'll want to convert ```Reviews``` to an integer, then: 

```python
apps3.Reviews = apps3.Reviews.astype(int)
```

It will give you a warning, because you still technically have a slice masquerading as a data frame:

```text
C:\Users\meredith.dodd\Anaconda3\lib\site-packages\pandas\core\generic.py:4401: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value
```

But it's ok, because the command has still worked just fine: 

```python
apps3.info()
```

```text
<class 'pandas.core.frame.DataFrame'>
Int64Index: 515 entries, 98 to 10740
Data columns (total 2 columns):
Category    515 non-null object
Reviews     515 non-null int32
dtypes: int32(1), object(1)
memory usage: 10.1+ KB
```

---

### Recoding ```Category``` to a Number

The post-hocs and assumptions won't take any string values, so you'll need to recode ```Category``` as well: 

```python
def recode (series):
    if series == "BEAUTY": 
        return 0
    if series == "FOOD_AND_DRINK": 
        return 1
    if series == "PHOTOGRAPHY": 
        return 2

apps3['CategoryR'] = apps3['Category'].apply(recode)
```

You get the same warning as above, but again, if you use ```.head()``` to examine the data, you see that things have worked ok, so you can proceed.

---

### Dropping the Original ```Category``` Variable

But wait! You now have three variables again! Go ahead and drop the original ```Category``` variable out, since it's mere presence will throw off the work you'll do later.

```python
apps4 = apps3[['CategoryR','Reviews']]
```

And finally, eons later, you are all prepared to run a one-way ANOVA and all it's assumptions and post-hoc tests. Phew! R required a lot less wrangling, because it is specifically meant for advanced statistics.

---

## Test Assumptions

Before you go any further, it's important to test for assumptions.  If the assumptions are not met for ANOVA, but you proceeded anyway, you run the risk of biasing your results. 

---

### Normality

You only need to test for the normality of the dependent variable, since the IV is categorical. 

```python
sns.distplot(apps4['Reviews'])
```

Here is the result: 

![A graph with the representation of reviews on the x-axis and it ranges from 0.0 to 1.0 in six units. The Y-axis ranges from 0.00000000 to 0.0000035. The highest bar reaches the point 0.0000035.](Media/ANOVA6.png)

Looks like that isn't normal in any way - it is very highly positively skewed.  So, you'll need to transform price by taking the square root or the log.  

```python
apps4['ReviewsSQRT'] = np.sqrt(apps4['Reviews'])
```

That looks relatively normal, so keep it there: 

![A graph depicts the plot reviews SQRT. The x-axis represents reviewsSQRT and it ranges from 0 to 3500 in 8 units. The y-axis ranges from 0.000 to 0.005. A curve is plotted and it has a peak point that crosses 0.002 on the y-axis.](Media/ANOVA7.png)

---

### Homogeneity of Variance

Just like in R, you can test for homogeneity of variance easily using either Bartlett's test or Fligner's Test.  Bartlett's test is for when your data is normally distributed, and Fligner's test is for when your data is non-parametric. No matter which test you are using, you are looking for a non-significant test.  The null hypothesis for both of these is that the data has equal variance, so you'd like to have a *p* value of > .05.  Since you have corrected your data, you can use Bartlett's test, but just for learning purposes, you'll try both here.

---

#### Bartlett's Test

To do Bartlett's test, use the function ```scipy.stats.bartlett()```, with the argument of the y data, followed by the x data.  

```python
scipy.stats.bartlett(apps4['ReviewsSQRT'], apps4['CategoryR'])
```

Here is the output: 

```text
BartlettResult(statistic=6187.981817647605, pvalue=0.0)
```

The *p* value associated with this test is < .05, which means that unfortunately, you have violated the assumption of homogeneity of variance.

---

#### Fligner's Test

To perform Fligner's test, use the function ```scipy.stats.fligner()```, with the argument of the y data, followed by the x data.

```{r}
scipy.stats.fligner(apps4['ReviewsSQRT'], apps4['CategoryR'])
```

Here is the output: 

```text
data: Price By Category
Flinger-Killeen: med chi-squared = 4.878, df = 2, p-value = 0.08725
```

The *p* value is still < .05, which means you have violated the assumption of homogeneity of variance.

---

#### Correcting for Violations of Homogeneity of Variance

As you know, there are many different ways to correct for this violation in the general field of statistics.  However, Python does not support any of them! Which means that you can run the ANOVA, but there is a good chance it will be inaccurate.  If you do choose to proceed with the analysis in Python, ensure that all parties consuming your results understand that there could be inaccuracies with the data analysis! 

It is recommended, however, that if you violate the assumption of homogeneity of variance that you switch over to R, and proceed from there.  You are becoming a guru in both languages for a reason!

---

### Sample Size

An ANOVA requires a sample size of at least 20 per independent variable.  In this case, you only have one independent variable, so as long as you have at least 20 cases, you are fine.  Looking at the data, the *n* is 515, so you are fine to proceed with this assumption!

---

### Independence 

There is no statistical test for the assumption of independence, so you can proceed!

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 9 - Computing ANOVAs with Equal Variance<a class="anchor" id="DS105L4_page_9"></a>

[Back to Top](#DS105L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">


In [7]:
from IPython.display import VimeoVideo
# Tutorial Video Name: Computing ANOVAs with Equal Variance
VimeoVideo('335514599', width=720, height=480)

The transcript for the above overview video **[is located here](https://repo.exeterlms.com/documents/V2/DataScience/Video-Transcripts/DSO105L04pg8-9tutorial.zip)**.

# Computing ANOVAs with Equal Variance (Met Homogeneity of Variance Assumption)

In this case, your data did not meet this assumption, but for the purposes of learning, you'll be shown what to do if you had.  

Below is the code to run a one-way ANOVA in Python. It uses the function ```stats.f_oneway()``` and the arguments are the three categories, crossed with your dependent variable. So here your DV is listed out, followed by the first level of your IV.  Each level is separated by a comma: 

```python
stats.f_oneway(apps['Reviews'][apps['Category']=='BEAUTY'],
                    apps['Reviews'][apps['Category']=='FOOD_AND_DRINK'],
                    apps['Reviews'][apps['Category']=='PHOTOGRAPHY'])
```

Which will provide the following output: 

```text
F_onewayResult(statistic=11.467490725511773, pvalue=1.342932747373518e-05)
```

Not much here, is there? Just the *F* value, under the name ```statistic```, and the *p* value.  Since the *p* value is less than .05, there is a significant difference in Reviews between these three categories.

---

# Computing ANOVAs with Unequal Variance (Violated Homogeneity of Variance Assumption)

There is NO WAY to compute ANOVAs with unequal variance in Python! Either switch over to R or be VERY CAUTIOUS when interpreting your results and don't use for anything high stakes!

---

## Post Hocs

It's important to run post-hocs to figure out what groups significantly differed from each other. In Python, the only automatically coding for post-hocs that is available is the Tukey post hoc, so that is what you will learn.

---

### Computing Post Hocs with Tukey's

Here is the code for computing a Tukey's post hoc in Python: 

```python
postHoc = MultiComparison(apps4['ReviewsSQRT'], apps4['CategoryR'])
postHocResults = postHoc.tukeyhsd()
print(postHocResults)
```

First you use the ```MultiComparison()``` function to specify the variables to use.  Then, you call the ```tukeyhsd()``` function to run the Tukey's correction on the data.  Finally, you can print the results, which are shown below: 

```text
Multiple Comparison of Means - Tukey HSD,FWER=0.05
===============================================
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
  0      1     111.89  -93.3166 317.0966 False 
  0      2    419.474  233.9713 604.9768  True 
  1      2    307.584  176.819  438.3491  True 
-----------------------------------------------
```

Interpreting this is a little harder than in R, because you've been forced to recode your categorical IV to have numbers instead.  So, make sure you refer back to that recode command to remember which number is which.  0 stands for beauty apps, 1 stands for food and drink apps, and 2 stands for photography apps. This output provides you with the mean difference in the number of reviews per comparison, plus the confidence interval (```lower``` and ```upper``` columns), and whether or not you can reject the null hypothesis. If the value in the ```reject``` column is ```True```, then this means there was a significant difference in the means between those groups.  So, there is a significant difference between the number of reviews between photography and both beauty and food and drink apps. What is that difference? Well, you will have to examine the means.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Because you are using a transformed dependent variable, it is difficult to interpret the mean difference presented in the table above.</p>
    </div>
</div>

---

### Computing Post Hocs When You've Violated the Assumption of Homogeneity of Variance

There is NO WAY to compute post hocs with unequal variance in Python! Either switch over to R or be VERY CAUTIOUS when interpreting your results and don't use for anything high stakes!

---

## Determine Means and Draw Conclusions

The last step is just to examine the means, to determine which apps had the highest and lowest number of reviews.  

```python
apps4.groupby('CategoryR').mean()
```

The ```groupby()``` function allows you to specify a grouping variable for an entire dataset, and you can then call the ```.mean()``` function on top of it.

Here's the result: 

![A table has four columns labeled category R, reviews, reviews SQRT, and reviews LOG. The row entries are as follows. Row 1, 0, 7476.226415, 48.854024, minus inf. Row 2, 1, 699.47.480315, 160.744038, minus inf. Row 3, 2, 637363.134328, 468.328067, minus inf.](Media/ANOVA8.png)

Looking at the reviews column, which has the means, you can say that photography apps had significantly more reviews than both beauty and food and drink apps. 

---

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 10 - One Way ANOVA in Python Activity<a class="anchor" id="DS105L4_page_10"></a>

[Back to Top](#DS105L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

For your Activity, you will be computing a one-way, between subjects ANOVA in Python to see if the video views on YouTube differs by the grade it received. `This Hands-On will not be graded`, but you are encouraged to complete it. The best way to become a great data scientist is to practice! Once you have submitted your project, you will be able to access the solution on the next page. Note that the solution will be slightly different from yours, but should look similar.

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

Using the **[YouTube Channels dataset edited for use in Python](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/YouTubeChannels_Python.zip)**, determine if there is a difference in the number of views (```Video views``` differs between all the different grade categories (```Grade```). To do this, you will need to: 

* Test for all assumptions and correct for them if possible
* Run an ANOVA 
* If significant, run an ANOVA
* Interpret your results 

Then write an overall, one-sentence conclusion about this data analysis. 

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>


<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 11 - One Way ANOVA in Python Activity Solution<a class="anchor" id="DS105L4_page_11"></a>

[Back to Top](#DS105L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Solution

Please look in **[this Python Notebook](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/DSO105oneWayANOVAsActivity.zip)** for the activity solution! 

<div class="panel panel-info">
    <div class="panel-heading">
        <h3 class="panel-title">Tip!</h3>
    </div>
    <div class="panel-body">
        <p>You will need to extract the zip file and save it to your computer, then open it in Jupyter Notebook in order to open the above file.</p>
    </div>
</div>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 12 - Key Terms<a class="anchor" id="DS105L4_page_12"></a>

[Back to Top](#DS105L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Key Terms

Below is a list and short description of the important keywords learned in this lesson. Please read through and go back and review any concepts you do not fully understand. Great Work!

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Multivariate</td>
        <td>Multiple variables.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Analysis of Variance (ANOVA)</td>
        <td>A statistical analysis with one or more categorical independent variables and one continuous dependent variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Between Subjects</td>
        <td>A type of ANOVA where the different levels of the IV are independent; like an independent t-test.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Within Subjects</td>
        <td>A type of ANOVA where the different levels of the IV are related; like a dependent t-test.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Repeated Measures</td>
        <td>Another term for within subjects.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>One-Way ANOVA</td>
        <td>An ANOVA with only one independent variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Factorial ANOVA</td>
        <td>An ANOVA with more than one independent variable (also called a factor).</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Mixed Measures ANOVA</td>
        <td>An ANOVA that uses both between and within subjects independent variables.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Homogeneity of Variance</td>
        <td>Equal variance.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Heterogeneity of Variance</td>
        <td>Unequal variance.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Independence</td>
        <td>An assumption that independent variables are not related to each other.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Sphericity</td>
        <td>An assumption that each level of independent variable is not related to any other level. Only pertains to repeated measures designs.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Bartlett's Test</td>
        <td>A test of homogeneity of variance to be used when your data is normally distributed.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Fligner's Test</td>
        <td>A test of homogeneity of variance to be used when your data is not normally distributed.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>Welch's One-Way Test</td>
        <td>A type of ANOVA that can be used when you don't meet the assumption of homogeneity of variance.</td>
    </tr>
</table>

---

## Key R Libraries

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>car</td>
        <td>A library that contains linear models.</td>
    </tr>
</table>

---

## Key R Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>bartlett.test()</td>
        <td>Performs Bartlett's Test for homogeneity of variance.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>fligner.test()</td>
        <td>Performs Fligner's Test for homogeneity of variance.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>aov()</td>
        <td>Performs an ANOVA.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>as.numeric()</td>
        <td>Makes a variable numeric.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>lm()</td>
        <td>Creates a linear model.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>pairwise.t.test()</td>
        <td>Conducts a post-hoc.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>p.adjust=</td>
        <td>An argument for pairwise.t.test() that allows you to choose the type of post hoc.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>pool.sd=</td>
        <td>An argument for pairwise.t.test() that determines whether or not you have equal variance.  Choose =TRUE if you met the assumption of homogeneity of variance.</td>
    </tr>
</table>

---

## Key Python Packages

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>statsmodels.stats.multicomp</td>
        <td>A package to get post hocs.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>pairwise_tukeyhsd</td>
        <td>A package to get Tukey's post hoc.</td>
    </tr>
</table>

---

## Key Python Code

<table class="table table-striped">
    <tr>
        <th>Keyword</th>
        <th>Description</th>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.copy()</td>
        <td>Makes a copy of your dataset.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.isin()</td>
        <td>An argument to choose only certain things.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.astype()</td>
        <td>A function to change the type of variable.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>sns.distplot()</td>
        <td>Creates a histogram with a fitted curve.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>scipy.stats.bartlett()</td>
        <td>Performs a Bartlett's Test for homogeneity of variance.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>scipy.stats.fligner()</td>
        <td>Performs a Fligner's Test for homogeneity of variance.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>stats.f_oneway()</td>
        <td>Performs a one-way ANOVA.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>MultiComparison()</td>
        <td>Performs post-hocs.</td>
    </tr>
    <tr>
        <td style="font-weight: bold;" nowrap>.groupby()</td>
        <td>Group data based on other variables.</td>
    </tr>
</table>

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

# Page 13 - Lesson 4 Hands-On<a class="anchor" id="DS105L4_page_13"></a>

[Back to Top](#DS105L4_toc)

<hr style="height:10px;border-width:0;color:gray;background-color:gray">

For this Hands On, you will be analyzing avocado prices and sales by location in both Python and R.  

<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Do not submit your project until you have completed all requirements, as you will not be able to resubmit.</p>
    </div>
</div>

---

## Requirements

This hands on uses a dataset about avocado sales across the country. It is located **[here](https://repo.exeterlms.com/documents/V2/DataScience/Intermediate-Stats/avocados.zip)**. For your amusement, here is the flavor text describing this dataset: 

> "It is a well known fact that Millenials LOVE Avocado Toast. It's also a well known fact that all Millenials live in their parents basements.

Clearly, they aren't buying home because they are buying too much Avocado Toast!

But maybe there's hope... if a Millenial could find a city with cheap avocados, they could live out the Millenial American Dream."

For each part, check and correct for assumptions if possible, perform the appropriate ANOVA, and provide a one-sentence conclusion at the bottom of your program files about the analysis you performed. 

---

### Part I: ANOVAs in R

Please answer the following question.

>Does the average price of avocados differ between Albany, Houston, and Seattle?

---

### Part II: ANOVAs in Python

Please answer the following question.

>Does the total volume of avocados sold differ between Indianapolis, Orlando, and PhoenixTuscon?


<div class="panel panel-danger">
    <div class="panel-heading">
        <h3 class="panel-title">Caution!</h3>
    </div>
    <div class="panel-body">
        <p>Be sure to zip and submit your entire directory when finished!</p>
    </div>
</div>
