<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#ANOVA---An-Acronym,-Not-a-Stellar-Object" data-toc-modified-id="ANOVA---An-Acronym,-Not-a-Stellar-Object-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>ANOVA - An Acronym, Not a Stellar Object</a></span><ul class="toc-item"><li><span><a href="#How-it-works---(let's-try-to-avoid-some-math)" data-toc-modified-id="How-it-works---(let's-try-to-avoid-some-math)-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>How it works - (let's try to avoid some math)</a></span><ul class="toc-item"><li><span><a href="#How-much-the-groups-vary" data-toc-modified-id="How-much-the-groups-vary-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>How much the groups vary</a></span></li><li><span><a href="#How-much-do-the-groups-vary-from-within" data-toc-modified-id="How-much-do-the-groups-vary-from-within-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>How much do the groups vary from within</a></span></li><li><span><a href="#Calculate-F-statistic" data-toc-modified-id="Calculate-F-statistic-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Calculate F-statistic</a></span></li></ul></li><li><span><a href="#Assumptions" data-toc-modified-id="Assumptions-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Assumptions</a></span></li></ul></li><li><span><a href="#Code-It-Up!" data-toc-modified-id="Code-It-Up!-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Code It Up!</a></span><ul class="toc-item"><li><span><a href="#SciPy:-Using-f-Oneway-for-ANOVA-Test" data-toc-modified-id="SciPy:-Using-f-Oneway-for-ANOVA-Test-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>SciPy: Using f-Oneway for ANOVA Test</a></span></li><li><span><a href="#Knowledge-Check!-🧠-ANOVA-Table-with-Statsmodels" data-toc-modified-id="Knowledge-Check!-🧠-ANOVA-Table-with-Statsmodels-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Knowledge Check! 🧠 ANOVA Table with Statsmodels</a></span><ul class="toc-item"><li><span><a href="#Let's-get-only-some-of-the-columns-(start-simple)" data-toc-modified-id="Let's-get-only-some-of-the-columns-(start-simple)-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Let's get only some of the columns (start simple)</a></span></li><li><span><a href="#Any-cleaning-of-the-columns-(renaming-too?)" data-toc-modified-id="Any-cleaning-of-the-columns-(renaming-too?)-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Any cleaning of the columns (renaming too?)</a></span></li></ul></li></ul></li></ul></div>

# ANOVA - An Acronym, Not a Stellar Object 

> https://en.wikipedia.org/wiki/Analysis_of_variance

_Well, it is pretty stellar but not in the space sense_

> Stands for "analysis of variance"

Looking to explain the variance as a combination

> test all the things!

## How it works - (let's try to avoid some math)

Like all tests, we calculate a statistic (F-ratio or F-statistic) to get a p-value to compare with the critical value

### How much the groups vary 

the between-group sum of squares

$SS$: Residuals from the mean (all the groups together)

### How much do the groups vary from within 

within-group sums of squares

$SS_{resid}$: Each group's residuals summed

### Calculate F-statistic

Kinda-Ratio: $$\frac{SS_{mean}-SS_{resid}}{SS_{resid}}$$


This is (basically) the ratio to find the F-statistic but we haven't included degrees of freedom and such

Check out this source for more: https://pythonfordatascience.org/anova-python/

## Assumptions 

1. The samples are independent.
2. Each sample is from a normally distributed population.
3. The population standard deviations of the groups are all equal. This property is known as homoscedasticity.

No good? To another test! (Suggested Kruskal-Wallis H-test but less power: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html#scipy.stats.kruskal)

# Code It Up!

## SciPy: Using f-Oneway for ANOVA Test

SciPy time: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html

In [None]:
import scipy.stats as stats

import numpy as np
import pandas as pd

# Plotting 
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Data found from above URL
tillamook = [0.0571, 0.0813, 0.0831, 0.0976, 0.0817, 0.0859, 0.0735,
             0.0659, 0.0923, 0.0836]
newport = [0.0873, 0.0662, 0.0672, 0.0819, 0.0749, 0.0649, 0.0835,
           0.0725]
petersburg = [0.0974, 0.1352, 0.0817, 0.1016, 0.0968, 0.1064, 0.105]
magadan = [0.1033, 0.0915, 0.0781, 0.0685, 0.0677, 0.0697, 0.0764,
           0.0689]
tvarminne = [0.0703, 0.1026, 0.0956, 0.0973, 0.1039, 0.1045]

# Nice DF for us
data = [ ]
data += [('tillamook', v) for v in tillamook]
data += [('newport', v) for v in newport]
data += [('petersburg', v) for v in petersburg]
data += [('magadan', v) for v in magadan]
data += [('tvarminne', v) for v in tvarminne]

In [None]:
df = pd.DataFrame(data=data, columns=['area','shell_standardized'])

In [None]:
df.head(20)

In [None]:
fig, ax = plt.subplots(figsize=(20,10))

sns.boxplot(
    x="area",
    y="shell_standardized",
    data=df,
    ax=ax,
    color='aqua', 
    linewidth=4
)

sns.swarmplot(
    x="area",
    y="shell_standardized",
    data=df,
    ax=ax,
    color='orange', 
    alpha=0.9, 
    size=12
)

In [None]:
fig, ax = plt.subplots(figsize=(20,10))

sns.violinplot(
    y="shell_standardized", 
    x="area", 
    data=df, 
    ax=ax,
    color='aqua',
    inner="quartile",  # Seeing the mean and quariles
    bw=.3              # How much smoothing do we use
)
sns.swarmplot(ax=ax, x="area", y="shell_standardized", data=df, color='orange', alpha=0.9, size=12)

In [None]:
# Is it significantly different?
stats.f_oneway(tillamook, newport, petersburg, magadan, tvarminne)

## Knowledge Check! 🧠 ANOVA Table with Statsmodels

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [None]:
# Read in data/flavors_of_cacao.csv
df_cacao = pd.read_csv('data/flavors_of_cacao.csv')
df_cacao.head()

### Let's get only some of the columns (start simple)

In [None]:
# Let's see the columns
df_cacao.columns

In [None]:
# Choose target as 'Rating' & at least one feature


### Any cleaning of the columns (renaming too?)

In [None]:
# Renaming?

In [None]:
# Anything else? (type?)

In [None]:
# Perform the ANOVA (statsmodels ols)