# A Simple Introduction to ANOVA  

**Analysis of variance (ANOVA)** is a statistical technique that is used to check if the means of two or more groups are significantly different from each other. ANOVA checks the impact of one or more factors by comparing the means of different samples.

We can use ANOVA to prove/disprove if attendance group has an impact on the increase in SAT scores.

## Terminologies related to ANOVA you need to know


### Grand Mean

There are two kinds of means that we use in ANOVA calculations, which are separate sample means ($\mu_1, \mu_2, \mu_3$) and the grand mean $\mu$  . The grand mean is the mean of sample means or the mean of all observations combined, irrespective of the sample.

### Hypothesis

The Null hypothesis in ANOVA is valid when all the sample means are equal, or they don’t have any significant difference. Thus, they can be considered as a part of a larger set of the population. On the other hand, the alternate hypothesis is valid when at least one of the sample means is different from the rest of the sample means. In mathematical form, they can be represented as:

$H_0: \mu_1 = \mu_2 = \mu_3 ...$

$H_a: \mu_1 \neq \mu_m $

In other words, the null hypothesis states that all the sample means are equal or the factor did not have any significant effect on the results. Whereas, the alternate hypothesis states that at least one of the sample means is different from another.  

We still can’t tell which one specifically. For that, we will use other methods that we will discuss later in this article.

### Between Group Variability

Consider the distributions of the below two samples. As these samples overlap, their individual means won’t differ by a great margin. Hence the difference between their individual means and grand mean won’t be significant enough.

<img src="img/between.png" width="300"/>

As the samples differ from each other by a big margin, their individual means would also differ. The difference between the individual means and grand mean would therefore also be significant.

<img src="img/very_different.png" width="300"/>

Such variability between the distributions called Between-group variability. It refers to variations between the distributions of individual groups (or levels) as the values within each group are different.

<img src="img/comparison_within.png" width="400"/>

We multiply each squared deviation by each sample size and add them up. This is called the **sum-of-squares for between-group variability*.* 

<img src="img/ss_between.png" width="400"/>

For our between-group variability, we will find each squared deviation, weigh them by their sample size, sum them up, and divide by the degrees of freedom, which in the case of between-group variability is the number of sample means (k) minus 1.

<img src="img/ms_between.png" width="400"/>

### Within Group Variability

<img src="img/within_group.png" width="400"/>

Such variations within a sample are denoted by Within-group variation. It refers to variations caused by differences within individual groups (or levels) as not all the values within each group are the same. Each sample is looked at on its own and variability between the individual points in the sample is calculated.


We can measure Within-group variability by looking at how much each value in each sample differs from its respective sample mean. So first, we’ll take the squared deviation of each value from its respective sample mean and add them up. This is the sum of squares for within-group variability.

<img src="img/ss_within.png" width="500"/>


Like between-group variability, we then divide the sum of squared deviations by the degrees of freedom  to find a less-biased estimator for the average squared deviation. 

<img src="img/df_within.png" width="700"/>

<img src="img/ms_within.png" width="400"/>


## F-Statistic

The statistic which measures if the means of different samples are significantly different or not is called the F-Ratio. Lower the F-Ratio, more similar are the sample means.

#### F = Between group variability / Within group variability



<img src="img/betweeN_and_within.png" width="400"/>


This F-statistic calculated here is compared with the F-critical value for making a conclusion. If the value of the calculated F-statistic is more than the F-critical value (for a specific α/significance level), then we reject the null hypothesis and can say that the treatment had a significant effect.

In [None]:
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
    
import matplotlib.pyplot as plt

# Loading data
df = pd.read_csv("https://raw.githubusercontent.com/Opensourcefordatascience/Data-sets/master/difficile.csv")
df.drop('person', axis= 1, inplace= True)
df.head()
# Recoding value from numeric to string
df['dose'].replace({1: 'placebo', 2: 'low', 3: 'high'}, inplace= True)
    
# Gettin summary statistics
df['libido'].describe()

In [None]:
df['libido'].groupby(df['dose']).describe()


In [None]:
stats.f_oneway(df['libido'][df['dose'] == 'high'], 
             df['libido'][df['dose'] == 'low'],
             df['libido'][df['dose'] == 'placebo'])

The F-statistic= 5.119 and the p-value= 0.025 which is indicating that there is an overall significant effect of medication on libido. However, we don’t know where the difference between dosing/groups is yet. This is in the post-hoc section

model_name = ols('outcome_variable ~ group1 + group2 + groupN', data=your_data).fit()



In [None]:
results = ols('libido ~ C(dose)', data=df).fit()
results.summary()

In [None]:
aov_table = sm.stats.anova_lm(results, type=2)
aov_table

### ANOVA Assumptions
There are 3 assumptions that need to be met for the results of an ANOVA test to be considered accurate and trust worthy. It’s important to note the the assumptions apply to the residuals and not the variables themselves. The ANOVA assumptions are the same as for linear regression and are:

- Normality

- Homogeneity of variance

- Independent observations

In [None]:
results.diagn

These are the same diagnostics from the bottom of the regression table from before. The Durban-Watson tests is to detect the presence of autocorrelation (not provided when calling diagnostics this way), Jarque-Bera (jb; jbpv is p-value) tests the assumption of normality, Omnibus (omni; omnipv is p-value) tests the assumption of homogeneity of variance, and the Condition Number (condno) assess multicollinearity. Condition Number values over 20 are indicative of multicollinearity.