# Practical - ANOVA
This practical session will demonstrate how to handle missing data. We assume everyone to have adequate understanding of Python programming language. For those who would like to refresh Python skill, we would like to recommend our <b>"Programming for Data Science Series"</b> where we covered almost all aspects of Python programming in data science domain.
Refer below URL for full playlist of almost 10 hours video lesson in Burmese Language.
URL : https://www.youtube.com/watch?v=jOZNjVVZIVs&list=PLD_eiqVVLZDi9GZZJDC8Zx4-3Np8LHs52

In [1]:
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/myanmards/resource_files/master/sample_anova.csv')
data.head()

Unnamed: 0,emp_id,first_name,last_name,gender,salary,work_exp
0,1,Georgi,Facello,M,500000,Mid
1,2,Bezalel,Simmel,F,120000,Junior
2,3,Parto,Bamford,M,350000,Junior
3,4,Chirstian,Koblick,M,400000,Mid
4,5,Kyoichi,Maliniak,M,200000,Junior


In [2]:
data.head(20)

Unnamed: 0,emp_id,first_name,last_name,gender,salary,work_exp
0,1,Georgi,Facello,M,500000,Mid
1,2,Bezalel,Simmel,F,120000,Junior
2,3,Parto,Bamford,M,350000,Junior
3,4,Chirstian,Koblick,M,400000,Mid
4,5,Kyoichi,Maliniak,M,200000,Junior
5,6,Anneke,Preusig,F,300000,Junior
6,7,Tzvetan,Zielinski,F,150000,Junior
7,8,Saniya,Kalloufi,M,750000,Mid
8,9,Sumant,Peac,F,750000,Senior
9,10,Duangkaew,Piveteau,F,200000,Junior


<b>First, we will create a dataframe consists with features we would like to perform ANOVA, i.e. salary & work_exp in this case</b>

In [3]:
df = data[['salary', 'work_exp']]
df.head()

Unnamed: 0,salary,work_exp
0,500000,Mid
1,120000,Junior
2,350000,Junior
3,400000,Mid
4,200000,Junior


<b>Following step, we will import the required library and perform oneway ANOVA</b>

In [4]:
from scipy import stats

F, p = stats.f_oneway(df[df.work_exp == 'Junior'].salary,
                      df[df.work_exp == 'Mid'].salary,
                      df[df.work_exp == 'Senior'].salary)
F, p

(43.767065333017584, 2.84255584357536e-14)

Now we will perform few things to understand the concept better:
* Create 3 bins based on salary and name them as Low, Medium, High
* Notice that these bins are created according to the salary value inside dataset
* Now we will perform oneway ANOVA based on newly created bins<br/>

In [5]:
import numpy as np
bins = np.linspace(min(df['salary']), max(df['salary']), 4)
bins

array([120000., 330000., 540000., 750000.])

In [6]:
bin_names = ['Low', 'Medium', 'High']
data['new_salary_group'] = pd.cut(data['salary'], bins, labels=bin_names, include_lowest=True)
data.head(10)

Unnamed: 0,emp_id,first_name,last_name,gender,salary,work_exp,new_salary_group
0,1,Georgi,Facello,M,500000,Mid,Medium
1,2,Bezalel,Simmel,F,120000,Junior,Low
2,3,Parto,Bamford,M,350000,Junior,Medium
3,4,Chirstian,Koblick,M,400000,Mid,Medium
4,5,Kyoichi,Maliniak,M,200000,Junior,Low
5,6,Anneke,Preusig,F,300000,Junior,Low
6,7,Tzvetan,Zielinski,F,150000,Junior,Low
7,8,Saniya,Kalloufi,M,750000,Mid,High
8,9,Sumant,Peac,F,750000,Senior,High
9,10,Duangkaew,Piveteau,F,200000,Junior,Low


<b>Since lowest salary will fall into employee type "Low" for newly created bin, the f-test score will be high</b>

In [7]:
F, p = stats.f_oneway(data[data.new_salary_group == 'Low'].salary,
                      data[data.new_salary_group == 'Medium'].salary,
                      data[data.new_salary_group == 'High'].salary)
F, p

(579.7277387424342, 1.1214878408179476e-54)