# lab-inferential-statistics-anova

## Part 1

#### In this activity, we will look at another example. Your task is to understand the problem and write down all the steps to set up ANOVA. After the next lesson, we will ask you to solve this problem using Python. Here are the steps that you would need to work on: - Null hypothesis - Alternate hypothesis - Level of significance - Test statistic - P-value - F table

## Context

#### Suppose you are working as an analyst in a microprocessor chip manufacturing plant. You have been given the task of analyzing a plasma etching process with respect to changing Power (in Watts) of the plasma beam. Data was collected and provided to you to conduct statistical analysis and check if changing the power of the plasma beam has any effect on the etching rate by the machine. You will conduct ANOVA and check if there is any difference in the mean etching rate for different levels of power. You can find the data anova_lab_data.xlsx file in the files_for_lab folder

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Reading the file into python as dataframe
anova_lab_data_df = pd.read_excel("anova_lab_data.xlsx")
anova_lab_data_df

Unnamed: 0,Power,Etching Rate
0,160 W,5.43
1,180 W,6.24
2,200 W,8.79
3,160 W,5.71
4,180 W,6.71
5,200 W,9.2
6,160 W,6.22
7,180 W,5.98
8,200 W,7.9
9,160 W,6.01


In [3]:
# Standardizing header-column names in the dataframe by using only lowercase letters and by replacing the spaces with underscores
cols = [anova_lab_data_df.columns[i].lower().replace(' ', '_') for i in range(len(anova_lab_data_df.columns))]
cols

['power_', 'etching_rate']

In [4]:
# assigning the names of the columns for the dataframe and display the dataframe
anova_lab_data_df.columns=cols
anova_lab_data_df.columns

Index(['power_', 'etching_rate'], dtype='object')

In [5]:
# renaming columns
anova_lab_data_df = anova_lab_data_df.rename(columns={'power_':'power'})
anova_lab_data_df.columns

Index(['power', 'etching_rate'], dtype='object')

In [6]:
# describing the dataframe where we can see useful info like n=count=15 (we will use it for the degrees of freedom)
anova_lab_data_df.describe()

Unnamed: 0,etching_rate
count,15.0
mean,6.782667
std,1.228643
min,5.43
25%,5.845
50%,6.24
75%,7.725
max,9.2


In [7]:
# we have 3 different groups with 5 observations per group
anova_lab_data_df_count = anova_lab_data_df['power'].value_counts()
anova_lab_data_df_count

160 W    5
180 W    5
200 W    5
Name: power, dtype: int64

In [8]:
# changing the observations (without W) in the column 'power'
anova_lab_data_df['power'] = np.where(anova_lab_data_df['power'].isin(['160 W']) , 160, anova_lab_data_df['power'])
anova_lab_data_df['power'] = np.where(anova_lab_data_df['power'].isin(['180 W']) , 180, anova_lab_data_df['power'])
anova_lab_data_df['power'] = np.where(anova_lab_data_df['power'].isin(['200 W']) , 200, anova_lab_data_df['power'])
anova_lab_data_df

Unnamed: 0,power,etching_rate
0,160,5.43
1,180,6.24
2,200,8.79
3,160,5.71
4,180,6.71
5,200,9.2
6,160,6.22
7,180,5.98
8,200,7.9
9,160,6.01


In [9]:
# grouping by power and getting the etching_rate mean of every group
anova_lab_data_df.groupby('power').agg(np.mean)

Unnamed: 0_level_0,etching_rate
power,Unnamed: 1_level_1
160,5.792
180,6.238
200,8.318


In [10]:
# or a different way by creating a function and then a new dataframe with the columns 160,180,200 and their etching_rate,respectively
def seperation(x):
    first=[]
    second=[]
    third=[]
    for index,row in x.iterrows():
        if row['power']==160:
            first.append(row['etching_rate'])
        elif row['power']==180:
            second.append(row['etching_rate'])
        elif row['power']==200:
            third.append(row['etching_rate'])
    return first, second, third

first, second, third = seperation(anova_lab_data_df)                     
first


[5.43, 5.71, 6.22, 6.01, 5.59]

In [11]:
anova={160:first,180:second,200:third}

In [12]:
# the new dataframe with the columns 160,180,200 and their etching_rate,respectively
new_df=pd.DataFrame.from_dict(anova)
new_df

Unnamed: 0,160,180,200
0,5.43,6.24,8.79
1,5.71,6.71,9.2
2,6.22,5.98,7.9
3,6.01,5.66,8.15
4,5.59,6.6,7.55


In [13]:
# n , mean and standard deviation of the first sample (160 W)
print('n:', len(new_df[160]),', mean:', np.mean(new_df[160]), ', standard deviation: ',np.std(new_df[160]))

n: 5 , mean: 5.792 , standard deviation:  0.28610487587596267


In [14]:
# n , mean and standard deviation of the second sample (180 W)
print('n:', len(new_df[180]),', mean:', np.mean(new_df[180]), ', standard deviation: ',np.std(new_df[180]))

n: 5 , mean: 6.2379999999999995 , standard deviation:  0.3884533434017525


In [15]:
# n , mean and standard deviation of the third sample (200 W)
print('n:', len(new_df[200]),', mean:', np.mean(new_df[200]), ', standard deviation: ',np.std(new_df[200]))

n: 5 , mean: 8.318 , standard deviation:  0.5989123475100505


#### State the null hypothesis

In [16]:
#### The assumptions for an one-way between groups ANOVA are:
#### 1. Samples are independent
#### 2. The response variable is approximately normally distributed for each group or all group sample sizes are at least 30
#### 3. The population variances are equal across responses for the group levels (if the largest sample standard deviation divided
#### by the smallest sample standard deviation is not greater than two, then assume that the population variances are equal)

In [17]:
#### Given that we are comparing 3 independent groups, the null hypothesis is:
#### Η0: μ1=μ2=μ3 (the means are similar) where μ1: the mean of the first group with power 160 W, μ2: the mean of the second group with power 180 W
#### and μ3: the mean of the third group with power 200 W

#### State the alternate hypothesis

In [18]:
#### alternative hypothesis -> H1 (or Ha): Not all μ are equal
#### the means are different (there is at least one design with the mean different from the others).

#### What is the significance level

In [19]:
#### Level of significance: This defines the rejection region/critical region, it's the probability of making the wrong decision
#### when the null hypothesis is true. Usually it is 0.05. It is defined by greek letter 'alpha'.
#### we are not in a medical field so we can use as a=0.05

#### What are the degrees of freedom of model, error terms, and total DoF

In [20]:
####  we will use ANOVA and f distribution (F-TEST)
#### df1 (Between Groups (Numerator) Degrees of Freedom): k-1, k=number of groups, we have 3 groups so df1=k-1=3-1=2
#### df2 (Within Groups (Denominator, Error) Degrees of Freedom): n-k, n=total sample size with all groups combined,
#### k=number of groups. So, n=15, k=3 and df2=n-k=15-3=12
#### the total degrees of freedom are df=df1+df2=2+12=14

## Part 2

#### In this section, use Python to conduct ANOVA.

In [21]:
# Analysis of Variance (ANOVA)
import statsmodels.api as sm
from statsmodels.formula.api import ols

model = ols('etching_rate~power', data=anova_lab_data_df).fit()
sm.stats.anova_lm(model)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
power,2.0,18.176653,9.088327,36.878955,8e-06
Residual,12.0,2.95724,0.246437,,


#### What conclusions can you draw from the experiment and why?

In [22]:
# Considering a significance level of 0.05, the p-value is below a=0.05 (0.000008<0.05), so we reject the null hypothesis.
# We accept the alternative hypothesis -> H1 (or Ha): Not all μ are equal
# We can also conclude that at least one of the designs resulted in different mean from the others
# (we can't tell exactly which, unless we use Tukey pairwise comparison-Tukey Simultaneous Tests).