# Colab notebook to perform 1-way ANOVA
In the following, you will first upload a data set of three measurements of flourescence in each of four groups of different storage conditions (Table 3.2 in tatistics and Chemometrics for Analytical Chemistry, Sixth edition) and represent it in two different plots. Next you will perform a 1-way ANOVA and a following LSD-test.

In [None]:
# Code cell 1

# Import needed libraries
import pandas as pd                       # To handle the data in a dataframe
import numpy as np                        # To calculate
import plotly.graph_objects as go         # To create plots
from scipy.stats import f                 # To look up F-values
from scipy.stats import t                 # To look up t-values
import statsmodels.api as sm              # To perform stastitical calculations
from statsmodels.formula.api import ols   # To define a statistical model

## Upload and read in the data
First of all, we need to upload the data to make it available here i colab, which can be done in the following way:
1. Click the folder icon in the left side of the browser window, which will open the file view in the side bar.
2. Click the left-most icon (page with arrow pointing up), which will open a window to select a file from your local computer.
3. Select the file called anova_data.xlsx, which has been distributed with this Colab notebook (you first need to download it to your computer).
4. The file will now appear in the side bar and we can use it in the code below.


In [None]:
# Code cell 2

# Read the uploaded excel file into a pandas dataframe and put it into an object called data
data = pd.read_excel('anova_data.xlsx')

# Show the content of the dataframe
data

## Display the data in a dot plot
First we take a look at the data by displaying the three individual measurements in each group, which makes it visually easy to compare the four groups.

In [None]:
# Code cell 3

# Create an empty figure object
fig1 = go.Figure()

# Add the data
fig1.add_scatter(x=data['Condition'], y=data['Flourescence'], mode='markers')

# Udate the layout
fig1.update_layout(title='Flourescence from solutions stored under different conditions',
                  xaxis_title='Storage condition', yaxis_title='Flourescence',
                  width=600, height=500)

# Show the figure
fig1.show()

## Display the data in a bar plot
Since we actually compare the means in the following ANOVA, let us calculate and display the means in a bar plot with the standard deviation as a measure of the variation within groups.

In [None]:
# Code cell 4

# Group the data by condition and calculate mean plus standard deviation for the flourescence variable
f_data = pd.DataFrame(round(data.groupby('Condition')['Flourescence'].agg(['mean', 'std']),2)).reset_index()

# Display the dataframe
f_data

In [None]:
# Code cell 5

# Let us present the data in a bar diagram with mean as the bar height
# Create empty figure object
fig2 = go.Figure()

# Add the data
fig2.add_bar(x=f_data['Condition'], y=f_data['mean'],
            error_y=dict(type='data', array=f_data['std']))

# Uodate the layout and y-axis (show only from 80 and up)
fig2.update_layout(title='Flourescence from solutions stored under different conditions',
                  xaxis_title='Storage condition', yaxis_title='Flourescence',
                  width=600, height=600)
fig2.update_yaxes(range=[80, 105])

# Show the figure
fig2.show()

## Perform the 1-way ANOVA
Test if the fluorescence measurement depend on the condition - are the means significantly different across the four conditions?

In [None]:
# Code cell 6

# First we define the statistical model and the data to use
model = ols('Flourescence ~ C(Condition)', data=data).fit()

# Then we perform the 1-way anova of the above model
aov_table = sm.stats.anova_lm(model, typ=2)

# Show the output
aov_table

Please note that the names in the output table here is different from the text book.
* C(Condition) = Between-sample
* Residual = Within-sample

Additionally, to match the output in the text book, we calculate and add the Mean square, MS:
$$MS = \frac{SS}{df}$$

Where:
- $SS =$ Sum of Square
- $df =$ Degrees of freedom

In [None]:
# Code cell 7

# Let us also calculate and add the mean square as a column
aov_table.insert(2, 'mean_sq', aov_table['sum_sq']/aov_table['df'])

# Show the updated table
aov_table


In [None]:
# Code cell 8

# Now let us find the critical value of F(3,8) at p=0.05 using scipy.
# Define the p-value
p_val = 0.05

# Define the degrees of freedom between samples and print it
df_between = aov_table.iloc[0,1]
print('Degrees of freedom between samples: ', round(df_between))

# Define the degrees of freedom within samples and print it
df_within = aov_table.iloc[1,1]
print('Degrees of freedom within samples: ', round(df_within))

# Look up the critical value using f.ppf from scipy and print it
crit_f = f.ppf(q=1 - p_val, dfn=df_between, dfd=df_within)
print('Critical value F(3,8) at p=0.05: ',round(crit_f, 3))

As the calculated value (20.7) is greater than the critical value the null hypothesis is rejected.

Thus, the sample means differ significantly.

## Find the least signifcant difference (LSD)
A significant difference can arise for several different reasons: for example, one mean may differ from all the others or all the means may differ from each other.

A simple way of deciding the reason for a significant result is to arrange the means in increasing order and compare
the difference between adjacent values with a quantity called the least significant difference, LSD.

$$LSD = s \times \sqrt{\frac{2}{n}} \times t_{h(n-1)}$$
Where:
- $n =$ observations in each *condition*
- $s = \sqrt{}$ of within-sample mean square
- $h(n-1) =$ degrees of fredom for $s$
  - $h =$ number of *conditions*

In [None]:
# Code cell 9

# To find the LSD, we first set the variables
n = 3
s = aov_table.iloc[1,2]
h = 4

# Calculate df of s, h(n - 1) and print it
df_s = h*(n-1)
print('Degrees of fredom for s: ',df_s)

# Find the critical value of t at p=0.05 for a two-tailed test (p-val/2)
crit_t = t.ppf(q=1 - p_val/2, df=df_s)
print(f'Critical value of t({df_s}) for two-tailed test at p={p_val}: ',round(crit_t,2))

# Calculate the LSD value
lsd = np.sqrt(s)*np.sqrt(2/n)*crit_t
print('Least significant difference: ', round(lsd, 2))

In [None]:
# Code cell 10

# Create a new dataframe with just conditions and mean
mean_data = f_data[['Condition', 'mean']].reset_index(drop=True)

# Sort by increasing mean
mean_data.sort_values(by=['mean'], inplace=True)

# Show the dataframe
mean_data

In [None]:
# Code cell 11

# Then let us calculate the differences
# Create an new dataframe with twice the conditions and means
diff_data = mean_data.merge(mean_data, how='cross')

# Get rid of the the comparisons of the same conditions
diff_data = diff_data[(diff_data['Condition_x'] != diff_data['Condition_y'])
& (diff_data['Condition_x'] < diff_data['Condition_y'])]

# Calculate the difference bwtween means
diff_data['diff'] = abs(diff_data['mean_x'] - diff_data['mean_y'])

# Sort by the conditions or by means - uncomment one of the below
diff_data.sort_values(by=['Condition_x', 'Condition_y'], ascending=True, inplace=True)
#diff_data.sort_values(by='diff', ascending=False, inplace=True)

# Reset the index
diff_data.reset_index(drop=True, inplace=True)

# Show the data
diff_data

Comparing LSD value with the differences between means will suggest whether the two compared conditions are significantly different.

In [None]:
# Code cell 12

# Add a new column with LSD
diff_data['LSD'] = round(lsd, 2)

diff_data['diff > LSD'] = diff_data['diff'].apply(lambda x: 'Yes' if x > lsd else 'No')

diff_data

From the above table, we can see that conditions A and B cannot be considered different while comparisons of all the other conditions show that they differ significantly from each other.