#  ANOVA

In statistics, **Analysis of Variance (ANOVA)** is also used to analyze the differences among group means. The difference between t-test and ANOVA is the former is ued to compare two groups whereas the latter is used to compare three or more groups. [Read more about the difference between t-test and ANOVA](http://b.link/anova24).

From the ANOVA test, you receive two numbers. The first number is called the **F-value** which indicates whether your null-hypothesis can be rejected. The critical F-value that rejects the null-hypothesis varies according to the number of total subjects and the number of subject groups in your experiment. In [this table](http://b.link/eda14) you can find the critical values of the F distribution. **If you are confused by the massive F-distribution table, don't worry. Skip F-value for now and study it at a later time. In this challenge you only need to look at the p-value.**

The p-value is another number yielded by ANOVA which already takes the number of total subjects and the number of experiment groups into consideration. **Typically if your p-value is less than 0.05, you can declare the null-hypothesis is rejected.**

In this challenge, we want to understand whether there are significant differences among various types of pokemons' `Total` value, i.e. Grass vs Poison vs Fire vs Dragon... There are many types of pokemons which makes it a perfect use case for ANOVA. 

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Load the data:
pokemon= pd.read_csv('pokemon.txt', sep=',')

In [3]:
pokemon.head(30)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
5,5,Charmeleon,Fire,,405,58,64,58,80,65,80,1,False
6,6,Charizard,Fire,Flying,534,78,84,78,109,85,100,1,False
7,6,CharizardMega Charizard X,Fire,Dragon,634,78,130,111,130,85,100,1,False
8,6,CharizardMega Charizard Y,Fire,Flying,634,78,104,78,159,115,100,1,False
9,7,Squirtle,Water,,314,44,48,65,50,64,43,1,False


In [4]:
pokemon.shape

(800, 13)

**To achieve our goal, we use three steps:**

1. **Extract the unique values of the pokemon types.**

1. **Select dataframes for each unique pokemon type.**

1. **Conduct ANOVA analysis across the pokemon types.**

#### First let's obtain the unique values of the pokemon types. These values should be extracted from Type 1 and Type 2 aggregated. Assign the unique values to a variable called `unique_types`.

*Hint: the correct number of unique types is 19 including `NaN`. You can disregard `NaN` in next step.*

In [5]:
unique_types1 = list(pokemon['Type 1'].unique())

In [6]:
unique_types1

['Grass',
 'Fire',
 'Water',
 'Bug',
 'Normal',
 'Poison',
 'Electric',
 'Ground',
 'Fairy',
 'Fighting',
 'Psychic',
 'Rock',
 'Ghost',
 'Ice',
 'Dragon',
 'Dark',
 'Steel',
 'Flying']

In [7]:
# 18 unique values

In [8]:
unique_types2 = list(pokemon['Type 2'].unique())

In [9]:
unique_types2

['Poison',
 nan,
 'Flying',
 'Dragon',
 'Ground',
 'Fairy',
 'Grass',
 'Fighting',
 'Psychic',
 'Steel',
 'Ice',
 'Rock',
 'Dark',
 'Water',
 'Electric',
 'Fire',
 'Ghost',
 'Bug',
 'Normal']

In [48]:
# NC: 19 unique values - we go with unique_types2

# If you actually aggregate Type 1 and Type 2, you get over 130 Types out, so that can't be correct?

  types = pokemon['Type 1'].append(pokemon['Type 2']).unique()


19

In [None]:
# Because Type 2 gives us a lot of NaNs and is empty after we get rid of Nans, I try again with Type 1:

In [38]:
pokemon_total = pokemon.pivot_table(index = ['Type 1'], values = 'Total', aggfunc= {'Total':'sum'})

In [41]:
pokemon_total_2 = pokemon_total.T

In [42]:
pokemon_total_2.head()

Type 1,Bug,Dark,Dragon,Electric,Fairy,Fighting,Fire,Flying,Ghost,Grass,Ground,Ice,Normal,Poison,Psychic,Rock,Steel,Water
Total,26146,13818,17617,19510,7024,11244,23820,1940,14066,29480,14000,10403,39365,11176,27129,19965,13168,48211


In [58]:
# Alternative:

pokemon['pokemon_count'] = pokemon.groupby('Type 1').cumcount() 

pokemon_pivot = pokemon.pivot(index='pokemon_count', columns='Type 1', values= 'Total')
pokemon_pivot.columns = [str(x) for x in pokemon_pivot.columns.values]
pokemon_pivot.head(50)

Unnamed: 0_level_0,Bug,Dark,Dragon,Electric,Fairy,Fighting,Fire,Flying,Ghost,Grass,Ground,Ice,Normal,Poison,Psychic,Rock,Steel,Water
pokemon_count,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
0,195.0,525.0,300.0,320.0,323.0,305.0,309.0,580.0,310.0,318.0,300.0,455.0,251.0,288.0,310.0,300.0,510.0,314.0
1,205.0,405.0,420.0,485.0,483.0,455.0,405.0,580.0,405.0,405.0,450.0,580.0,349.0,438.0,400.0,390.0,610.0,405.0
2,395.0,430.0,600.0,325.0,218.0,305.0,534.0,245.0,500.0,525.0,265.0,250.0,479.0,275.0,500.0,495.0,465.0,530.0
3,195.0,330.0,490.0,465.0,245.0,405.0,634.0,535.0,600.0,625.0,405.0,450.0,579.0,365.0,590.0,385.0,380.0,630.0
4,205.0,500.0,590.0,330.0,405.0,505.0,634.0,,435.0,320.0,320.0,330.0,253.0,505.0,328.0,355.0,480.0,320.0
5,395.0,600.0,300.0,480.0,300.0,455.0,299.0,,295.0,395.0,425.0,305.0,413.0,273.0,483.0,495.0,330.0,500.0
6,495.0,220.0,420.0,490.0,450.0,455.0,505.0,,455.0,490.0,345.0,300.0,262.0,365.0,460.0,355.0,430.0,300.0
7,285.0,420.0,600.0,525.0,545.0,210.0,350.0,,555.0,300.0,485.0,480.0,442.0,505.0,680.0,495.0,530.0,385.0
8,405.0,380.0,700.0,580.0,303.0,455.0,555.0,,295.0,390.0,430.0,580.0,270.0,245.0,780.0,515.0,630.0,510.0
9,305.0,480.0,600.0,205.0,371.0,237.0,410.0,,455.0,490.0,330.0,290.0,435.0,455.0,780.0,615.0,300.0,335.0


In [59]:
pokemon_pivot.shape

(112, 18)

In [61]:
pokemon_pivot= pokemon_pivot.fillna(0)

#### Second: we will create a list named `pokemon_totals` to contain the `Total` values of each unique type of pokemons.

Why we use a list instead of a dictionary to store the pokemon `Total`? It's because ANOVA only tells us whether there is a significant difference of the group means but does not tell which group(s) are significantly different. Therefore, we don't need know which `Total` belongs to which pokemon type.

*Hints:*

* Loop through `unique_types` and append the selected type's `Total` to `pokemon_groups`.
* Skip the `NaN` value in `unique_types`. `NaN` is a `float` variable which you can find out by using `type()`. The valid pokemon type values are all of the `str` type.
* At the end, the length of your `pokemon_totals` should be 18.

In [16]:
# Don't understand these steps, didn't work for me, left it - please explain this in class, if we have time!

list

[[26146,
  13818,
  17617,
  19510,
  7024,
  11244,
  23820,
  1940,
  14066,
  29480,
  14000,
  10403,
  39365,
  11176,
  27129,
  19965,
  13168,
  48211]]

#### Now we run ANOVA test on `pokemon_totals`.

*Hints:*

* To conduct ANOVA, you can use `scipy.stats.f_oneway()`. Here's the [reference](http://b.link/scipy44).

* What if `f_oneway` throws an error because it does not accept `pokemon_totals` as a list? The trick is to add a `*` in front of `pokemon_totals`, e.g. `stats.f_oneway(*pokemon_groups)`. This trick breaks the list and supplies each list item as a parameter for `f_oneway`.

In [62]:
st.f_oneway(pokemon_pivot.Bug,pokemon_pivot.Dark,pokemon_pivot.Dragon,pokemon_pivot.Electric,pokemon_pivot.Fairy,pokemon_pivot.Fighting, pokemon_pivot.Fire, pokemon_pivot.Flying, pokemon_pivot.Ghost, pokemon_pivot.Grass, pokemon_pivot.Ground, pokemon_pivot.Ice, pokemon_pivot.Normal, pokemon_pivot.Poison, pokemon_pivot.Psychic, pokemon_pivot.Rock, pokemon_pivot.Steel, pokemon_pivot.Water)

F_onewayResult(statistic=28.674566584818066, pvalue=2.254687684597403e-82)

#### Interpret the ANOVA test result. Is the difference significant?

In [None]:
# The pvalue is under 5 Percent, which means that we can reject H0.