# Challenge 2 - ANOVA

In statistics, **Analysis of Variance (ANOVA)** is also used to analyze the differences among group means. The difference between t-test and ANOVA is the former is ued to compare two groups whereas the latter is used to compare three or more groups. [Read more about the difference between t-test and ANOVA](https://keydifferences.com/difference-between-t-test-and-anova.html).

From the ANOVA test, you receive two numbers. The first number is called the **F-value** which indicates whether your null-hypothesis can be rejected. The critical F-value that rejects the null-hypothesis varies according to the number of total subjects and the number of subject groups in your experiment. In [this table](https://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm) you can find the critical values of the F distribution. **If you are confused by the massive F-distribution table, don't worry. Skip F-value for now and study it at a later time. In this challenge you only need to look at the p-value.**

The p-value is another number yielded by ANOVA which already takes the number of total subjects and the number of experiment groups into consideration. **Typically if your p-value is less than 0.05, you can declare the null-hypothesis is rejected.**

In this challenge, we want to understand whether there are significant differences among various types of pokemons' `Total` value, i.e. Grass vs Poison vs Fire vs Dragon... There are many types of pokemons which makes it a perfect use case for ANOVA.

In [1]:
# Import libraries
import pandas as pd
from scipy.stats import f_oneway


In [2]:
# Import dataset

pokemon = pd.read_csv('../../lab-df-calculation-and-transformation/your-code/Pokemon.csv')

pokemon.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


**To achieve our goal, we use three steps:**

1. **Extract the unique values of the pokemon types.**

1. **Select dataframes for each unique pokemon type.**

1. **Conduct ANOVA analysis across the pokemon types.**

#### First let's obtain the unique values of the pokemon types. These values should be extracted from Type 1 and Type 2 aggregated. Assign the unique values to a variable called `unique_types`.

*Hint: the correct number of unique types is 19 including `NaN`. You can disregard `NaN` in next step.*

In [3]:
unique_types_pre = pokemon['Type 1'].unique().tolist()+ pokemon['Type 2'].unique().tolist()
unique_types_1= set(unique_types_pre)
unique_types=list(unique_types_1)
unique_types
# you should see 19

['Fairy',
 nan,
 'Steel',
 'Water',
 'Dragon',
 'Rock',
 'Dark',
 'Poison',
 'Ground',
 'Fire',
 'Ghost',
 'Grass',
 'Fighting',
 'Ice',
 'Normal',
 'Flying',
 'Psychic',
 'Electric',
 'Bug']

#### Second we will create a list named `pokemon_totals` to contain the `Total` values of each unique type of pokemons.

Why we use a list instead of a dictionary to store the pokemon `Total`? It's because ANOVA only tells us whether there is a significant difference of the group means but does not tell which group(s) are significantly different. Therefore, we don't need know which `Total` belongs to which pokemon type.

*Hints:*

* Loop through `unique_types` and append the selected type's `Total` to `pokemon_groups`.
* Skip the `NaN` value in `unique_types`. `NaN` is a `float` variable which you can find out by using `type()`. The valid pokemon type values are all of the `str` type.
* At the end, the length of your `pokemon_totals` should be 18.

In [66]:
pokemon_totals = []
for i in unique_types:
    if type(i)!=float:
        pokemon_totals.append(pokemon[(pokemon['Type 2']==i)]['Total'])or(pokemon[(pokemon['Type 1']==i)]['Total'])

len(pokemon_totals) # you should see 18

18

In [67]:
pokemon_totals

[44     270
 45     435
 131    460
 188    210
 198    250
 199    420
 303    198
 304    278
 305    518
 306    618
 322    190
 328    380
 329    480
 366    590
 487    310
 591    545
 606    280
 607    480
 772    431
 773    500
 777    470
 795    600
 796    700
 Name: Total, dtype: int64, 88     325
 89     465
 220    465
 228    500
 229    600
 440    530
 455    350
 456    495
 460    424
 497    525
 498    625
 513    535
 528    525
 542    600
 589    508
 650    495
 658    305
 659    489
 685    340
 686    490
 693    484
 717    600
 Name: Total, dtype: int64, 149    355
 150    495
 151    355
 152    495
 307    269
 398    290
 399    410
 400    530
 445    410
 533    520
 758    306
 759    500
 760    320
 799    600
 Name: Total, dtype: int64, 7      634
 196    610
 249    540
 275    630
 360    340
 361    520
 540    680
 541    680
 544    680
 545    680
 694    300
 695    420
 696    600
 761    494
 766    362
 767    521
 790    245
 791   

#### Now we run ANOVA test on `pokemon_totals`.

*Hints:*

* To conduct ANOVA, you can use `scipy.stats.f_oneway()`. Here's the [reference](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html).

* What if `f_oneway` throws an error because it does not accept `pokemon_totals` as a list? The trick is to add a `*` in front of `pokemon_totals`, e.g. `stats.f_oneway(*pokemon_groups)`. This trick breaks the list and supplies each list item as a parameter for `f_oneway`.

In [68]:
# Your code here
f_oneway(*pokemon_totals)

F_onewayResult(statistic=2.7463260975846784, pvalue=0.00024695147909881604)

#### Interpret the ANOVA test result. Is the difference significant?

In [None]:
# Your comment here