# One-way ANOVA

Imagine you have three groups, and you want to do a one-level ANOVA to test for
overall differences across the groups.

The general technique for a permutation test is:

* You decide on your metric
* You get your metric for the actual data - observed metric
* You permute your data and take the same metric from the permuted data, and
  repeat many times - fake metrics
* You compare your observed metric to your fake metrics, to see how unusual it
  is.

For a two-sample permutation test, your metric is the difference in the two
sample means.

For a three sample version of the test — we need a metric that will be big
where there are big differences between the three groups, and small when there
are small differences.

Let us reflect on what what we want from the metric.  It should be a single
number to summarize all the values from the groups. It should be be large for
big differences between the means for the various groups.  It should be larger
when more observations are in the groups with large difference in means.

Consider the following metric.  We will soon see this is the metric that the F-test uses.

* Get the sample means for each of the three groups A, B, C, to give `mean_a`,
  `mean_b`, `mean_c`
* Get the mean across all the observations regardless of group
  (`mean_overall`)
* Subtract `mean_overall` from each of `mean_a`, `mean_b`, `mean_c` to give
  `mean_a_diff`, `mean_b_diff`, `mean_c_diff`.
* We are interested in positive as well as negative differences, so we do not
  want to add these mean differences, otherwise the positive and negative means
  differences will cancel out. So we next square the differences to give:
  `sq_mean_a_diff`, `sq_mean_b_diff`, `sq_mean_c_diff`.
* We want larger groups to have greater weight than small groups.  Call the
  number in groups A, B, and C `n_a`, `n_b`, `n_c`. To weight the squared mean
  differences we multiply each square mean difference by the number in each
  group: `sq_mean_a_diff * n_a`, `sq_mean_b_diff * n_b`, `sq_mean_c_diff *
  n_c`, to give `nsq_mean_a_diff`, `nsq_mean_b_diff`, `nsq_mean_c_diff`.
*   Finally, we add up the group `nsq` scores to give our metric:

    ```
    our_metric = nsq_mean_a_diff + nsq_mean_b_diff + nsq_mean_c_diff
    ```

We will call this the SNSQGMD metric (Sum of N times SQuared Group Mean
Difference).

SNSQGMD will be large and positive when the individual groups have different
means from each other and small when the means for the groups are pretty
similar to each other, and therefore, to the overall mean.  It will be larger
when larger groups have means with bigger deviations from the overall mean.

To follow the recipe above, we calculate SNSQGMD for the actual groups A, B, C.
Permute the group labels to give random groups A, B, C, and recalculate the
metric.   See whether SNSQGMD in the real data is unusual in the distribution of
the same metric for the permuted groups.

This is the permutation equivalent of the one-way ANOVA.   The one-way ANOVA
just uses some assumptions from the normal distribution to estimate the spread
in the random distribution of SNSQGMD, instead of using permutation to calculate
the random distribution.


## An example

In [126]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Our dataset is a table giving the amount of weight lost for a group of people allocated to one of three possible diets, called `A`, `B` and `C`.

See: [the dataset page](https://github.com/odsti/datasets/tree/master/sheffield_diet) for more detail.

In [127]:
# Read the raw dataset
diet_data = pd.read_csv('sheffield_diet.csv')
diet_data.head()

Unnamed: 0,gender,diet,weight_lost
0,Female,A,3.8
1,Female,A,6.0
2,Female,A,0.7
3,Female,A,2.9
4,Female,A,2.8


In [128]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

mfit = smf.ols('weight_lost ~ gender * diet', data=diet_data).fit()

sm.stats.anova_lm(mfit, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
gender,0.168696,1.0,0.031379,0.85991
diet,60.41722,2.0,5.619026,0.005456
gender:diet,33.904068,2.0,3.153204,0.048842
Residual,376.329043,70.0,,


Here we are only interested in the last two columns:

In [129]:
# The group means by diet
diet_data.groupby('diet')['weight_lost'].mean()

diet
A    3.300000
B    3.268000
C    5.148148
Name: weight_lost, dtype: float64

In [130]:
# Subtract group means for diet.
# This column makes a new data frame that has one row as per
# the original data frame, with new column that contains
# the group means
diet_means = diet_data.groupby('diet')['weight_lost'].transform('mean')
diet_means

0     3.300000
1     3.300000
2     3.300000
3     3.300000
4     3.300000
        ...   
71    5.148148
72    5.148148
73    5.148148
74    5.148148
75    5.148148
Name: weight_lost, Length: 76, dtype: float64

In [131]:
minus_gms = diet_data['weight_lost'] - diet_means
minus_gms

0     0.500000
1     2.700000
2    -2.600000
3    -0.400000
4    -0.500000
        ...   
71   -2.348148
72   -1.048148
73    0.151852
74    4.051852
75    0.951852
Name: weight_lost, Length: 76, dtype: float64

In [132]:
# Exactly the same as subtracting group by group.
is_A = diet_data['diet'] == 'A'
only_A = diet_data[is_A]
a_demeaned = only_A['weight_lost'] - np.mean(only_A['weight_lost'])
a_demeaned.head()

0    0.5
1    2.7
2   -2.6
3   -0.4
4   -0.5
Name: weight_lost, dtype: float64

In [133]:
# Put this into a function
def subtract_means(df, group_col, value_col):
    mean_col = df.groupby(group_col)[value_col].transform('mean')
    return df[value_col] - mean_col

In [134]:
subtract_means(diet_data, 'diet', 'weight_lost')

0     0.500000
1     2.700000
2    -2.600000
3    -0.400000
4    -0.500000
        ...   
71   -2.348148
72   -1.048148
73    0.151852
74    4.051852
75    0.951852
Name: weight_lost, Length: 76, dtype: float64

In [135]:
# Now we've subtracted the means by group, the group means
# should all be very close to 0
new_df = diet_data.copy()
new_df['diet_demeaned'] = subtract_means(new_df, 'diet', 'weight_lost')
new_df.head()

Unnamed: 0,gender,diet,weight_lost,diet_demeaned
0,Female,A,3.8,0.5
1,Female,A,6.0,2.7
2,Female,A,0.7,-2.6
3,Female,A,2.9,-0.4
4,Female,A,2.8,-0.5


In [136]:
# Means by group now very very close to 0.
new_df.groupby('diet')['diet_demeaned'].mean()

diet
A    6.106227e-16
B    5.506706e-16
C    1.019760e-15
Name: diet_demeaned, dtype: float64

In [137]:
# This implies that the overall mean is going to be very very close to 0
new_df['diet_demeaned'].mean()

7.829993963145841e-16

In [138]:
# Effect of gender.
gender_group = new_df.groupby('gender')['diet_demeaned']
gender_means = gender_group.mean()
gender_means

gender
Female   -0.041261
Male      0.053764
Name: diet_demeaned, dtype: float64

In [139]:
# Sum of squared effect of gender
ss_mg = np.sum(gender_group.count() * gender_means ** 2)
ss_mg

0.16859598416551386

In [153]:
sm.stats.anova_lm(mfit, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
gender,0.168696,1.0,0.031379,0.85991
diet,60.41722,2.0,5.619026,0.005456
gender:diet,33.904068,2.0,3.153204,0.048842
Residual,376.329043,70.0,,


In [155]:
sm.stats.anova_lm(mfit, typ=1)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
gender,1.0,0.278485,0.278485,0.0518,0.820623
diet,2.0,60.41722,30.20861,5.619026,0.005456
gender:diet,2.0,33.904068,16.952034,3.153204,0.048842
Residual,70.0,376.329043,5.376129,,


In [156]:
get_sn_sq_gmd(diet_data, 'gender', 'weight_lost')

0.27848457030525375

In [154]:
mf = smf.ols('weight_lost ~ diet * gender', data=diet_data).fit()
sm.stats.anova_lm(mf, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
diet,60.41722,2.0,5.619026,0.005456
gender,0.168696,1.0,0.031379,0.85991
diet:gender,33.904068,2.0,3.153204,0.048842
Residual,376.329043,70.0,,


In [151]:
get_sn_sq_gmd(diet_data, 'diet', 'weight_lost')

60.52700838206624

In [141]:
# Make this into a function
def get_adj_sn_sq_gmd(df, group_col, remove_col, val_col):
    df_adj = df.copy()
    df_adj['adj'] = subtract_means(df_adj, remove_col, val_col)
    return get_sn_sq_gmd(df_adj, group_col, 'adj')

In [142]:
get_adj_sn_sq_gmd(diet_data, 'gender', 'diet', 'weight_lost')

0.16859598416551386

In [143]:
get_adj_sn_sq_gmd(diet_data, 'diet', 'gender', 'weight_lost')

60.40356943825723

In [144]:
ndf2 = diet_data.copy()
ndf2['gender_demeaned'] = subtract_means(ndf2, 'gender', 'weight_lost')
ndf2.groupby('gender')['gender_demeaned'].mean()

gender
Female   -6.196594e-17
Male      4.306320e-16
Name: gender_demeaned, dtype: float64

In [145]:
ndf2['gender_demeaned'].mean()

1.8698493046318425e-16

In [146]:
diet_group = ndf2.groupby('diet')['gender_demeaned']
ss_mg_gender = np.sum(diet_group.count() * diet_group.mean() ** 2)
ss_mg_gender

60.40356943825723

In [157]:
# Order matters.
# Subtract gender, then diet means
g_then_d = diet_data.copy()
g_then_d['g_adj'] = subtract_means(g_then_d, 'gender', 'weight_lost')
g_then_d['both_adj'] = subtract_means(g_then_d, 'diet', 'g_adj')
g_then_d.head()

Unnamed: 0,gender,diet,weight_lost,g_adj,both_adj
0,Female,A,3.8,-0.093023,0.550887
1,Female,A,6.0,2.106977,2.750887
2,Female,A,0.7,-3.193023,-2.549113
3,Female,A,2.9,-0.993023,-0.349113
4,Female,A,2.8,-1.093023,-0.449113


In [158]:
# Subtract diet, then gender means
d_then_g = diet_data.copy()
d_then_g['d_adj'] = subtract_means(d_then_g, 'diet', 'weight_lost')
d_then_g['both_adj'] = subtract_means(d_then_g, 'gender', 'd_adj')
d_then_g.head()

Unnamed: 0,gender,diet,weight_lost,d_adj,both_adj
0,Female,A,3.8,0.5,0.541261
1,Female,A,6.0,2.7,2.741261
2,Female,A,0.7,-2.6,-2.558739
3,Female,A,2.9,-0.4,-0.358739
4,Female,A,2.8,-0.5,-0.458739


In [159]:
# We need to use optimize to deal with this.
# First we need dummy columns.
gender_dummies = pd.get_dummies(diet_data['gender'])
gender_dummies

Unnamed: 0,Female,Male
0,1,0
1,1,0
2,1,0
3,1,0
4,1,0
...,...,...
71,0,1
72,0,1
73,0,1
74,0,1


In [178]:
# Use optimize to get the best least-squares parameters
def rmse(params, y, cols):
    predicted = np.sum(np.array(params) * np.array(cols), axis=1)
    error = y - predicted
    return np.sqrt(np.mean(error ** 2))

In [187]:
just_ones = np.ones((len(diet_data), 1))

In [189]:
y = diet_data['weight_lost']
rmse([1], y, just_ones)

3.8568974441559685

In [197]:
overall_mean = np.mean(y)
overall_mean

3.9460526315789486

In [198]:
rmse(overall_mean, y, just_ones)

2.4892633020039487

In [199]:
from scipy.optimize import minimize
res = minimize(rmse, [0], args=(y, just_ones))
res

      fun: 2.4892633020039505
 hess_inv: array([[2.48991414]])
      jac: array([-5.96046448e-08])
  message: 'Optimization terminated successfully.'
     nfev: 14
      nit: 5
     njev: 7
   status: 0
  success: True
        x: array([3.94605253])

In [194]:
rmse([1, 1], y, gender_dummies)

3.8568974441559685

In [200]:
g_means = diet_data.groupby('gender')['weight_lost'].mean()
g_means

gender
Female    3.893023
Male      4.015152
Name: weight_lost, dtype: float64

In [201]:
rmse(g_means, diet_data['weight_lost'], gender_dummies)

2.4885271780797753

In [181]:
from scipy.optimize import minimize
res = minimize(rmse, [0, 0], args=(diet_data['weight_lost'], gender_dummies))
res

      fun: 2.488527178085491
 hess_inv: array([[4.27660579, 0.09271357],
       [0.09271357, 5.66055176]])
      jac: array([-1.34110451e-06,  7.74860382e-07])
  message: 'Optimization terminated successfully.'
     nfev: 33
      nit: 8
     njev: 11
   status: 0
  success: True
        x: array([3.89301735, 4.01515599])

In [203]:
all_dummies = pd.get_dummies(diet_data.loc[:, ['gender', 'diet']])
all_dummies

Unnamed: 0,gender_Female,gender_Male,diet_A,diet_B,diet_C
0,1,0,1,0,0
1,1,0,1,0,0
2,1,0,1,0,0
3,1,0,1,0,0
4,1,0,1,0,0
...,...,...,...,...,...
71,0,1,0,0,1
72,0,1,0,0,1
73,0,1,0,0,1
74,0,1,0,0,1


In [204]:
starting = np.zeros(len(all_dummies.columns))
res = minimize(rmse, starting, args=(y, all_dummies))
res

      fun: 2.3233174771190814
 hess_inv: array([[ 3.07139822, -1.27309257,  0.26237771,  0.30858282,  0.22731302],
       [-1.27309257,  3.82104338,  0.66446696,  0.48205817,  0.40142805],
       [ 0.26237771,  0.66446696,  5.47407237, -1.83141112, -1.71722954],
       [ 0.30858282,  0.48205817, -1.83141112,  5.28343428, -1.66006448],
       [ 0.22731302,  0.40142805, -1.71722954, -1.66006448,  5.00610079]])
      jac: array([ 2.98023224e-08,  0.00000000e+00,  2.98023224e-08, -2.98023224e-08,
        2.98023224e-08])
  message: 'Optimization terminated successfully.'
     nfev: 84
      nit: 11
     njev: 14
   status: 0
  success: True
        x: array([2.29947103, 2.39455244, 0.96091195, 0.926693  , 2.80641883])

## Cheating

In [218]:
grouped = diet_data.groupby(['gender', 'diet'])
groups = grouped.groups
groups

{('Female', 'A'): [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], ('Female', 'B'): [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], ('Female', 'C'): [28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42], ('Male', 'A'): [43, 44, 45, 46, 47, 48, 49, 50, 51, 52], ('Male', 'B'): [53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], ('Male', 'C'): [64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75]}

In [222]:
equal_indices = []
for group_ids in groups:
    indices = groups[group_ids]
    equal_indices += list(indices)[:10]
diet_cheat = diet_data.loc[equal_indices]
diet_cheat.head()

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], dtype='int64')
Int64Index([14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27], dtype='int64')
Int64Index([28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42], dtype='int64')
Int64Index([43, 44, 45, 46, 47, 48, 49, 50, 51, 52], dtype='int64')
Int64Index([53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63], dtype='int64')
Int64Index([64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75], dtype='int64')


Unnamed: 0,gender,diet,weight_lost
0,Female,A,3.8
1,Female,A,6.0
2,Female,A,0.7
3,Female,A,2.9
4,Female,A,2.8


In [215]:
groups = []
for element in group_ns.index:
    group = grouped.get_group(element).iloc[:10]
    groups.append(group)
diet_cheat = pd.concat(groups)
diet_cheat.head()

0    3.8
1    6.0
2    0.7
3    2.9
4    2.8
Name: weight_lost, dtype: float64

## Means by group

Here are the data, plotted by group.

In [None]:
diets.plot.scatter('diet', 'weight_lost');

These are the means for each of the three groups.

In [None]:
group_means = diets.groupby('diet').mean()
group_means

These are the number of observations per group:

In [None]:
group_ns = diets.groupby('diet').count()
group_ns

Here is the overall mean, ignoring the group membership:

In [None]:
overall_mean = np.mean(diets['weight_lost'])
overall_mean

The next plot shows the data, the group means, and the overall mean:

In [None]:
diets.plot.scatter('diet', 'weight_lost',
                         label='Data')
plt.scatter(group_means.index, np.array(group_means), color='red',
            label='Group means')
# A dashed line at the overall mean.
plt.plot(group_means.index,
         [overall_mean, overall_mean, overall_mean],
         ':', color='green',
         label='Overall mean')
# A dashed line between each group mean and the overall mean.
for group in group_means.index:
    xs = [group, group]
    ys = [float(group_means.loc[group]), overall_mean]
    plt.plot(xs, ys, ':', color='red')
plt.legend();

Notice the red dashed lines between the group means and the
overall mean.  To make these easier to see, here is the same
plot, without the individual data points:

In [None]:
plt.scatter(group_means.index, np.array(group_means), color='red',
            label='Group means')
# A dashed line at the overall mean.
plt.plot(group_means.index,
         [overall_mean, overall_mean, overall_mean],
         ':', color='green',
         label='Overall mean')
# A dashed line between each group mean and the overall mean.
for group in group_means.index:
    xs = [group, group]
    ys = [float(group_means.loc[group]), overall_mean]
    plt.plot(xs, ys, ':', color='red')
plt.legend();

We designed our SNSQGMD metric to be large when the sum of the squared lengths of
these lines are large.  The N in  SNSQGMD reminds us we multiply each squared
length by the number in the group, to give more weight to large groups.

To calculate SNSQGMD we get the Group Mean Difference.

In [None]:
gmd = group_means - overall_mean
gmd

We square these differences:

In [None]:
sq_gmd = gmd ** 2
sq_gmd

We want to give more weight to groups with more members, so we multiply each
squared difference by the number in the group:

In [None]:
n_sq_gmd = sq_gmd * group_ns
n_sq_gmd

Finally, we add up these weighted squares to get the final metric:

In [None]:
observed_sn_sq_gmd = np.sum(n_sq_gmd)
observed_sn_sq_gmd

To make the process a bit clearer, we put the calculation of our
metric into its own function so we can re-use it on different data frames.

In [None]:
def get_sn_sq_gmd(df, group_col, val_col):
    overall_mean = np.mean(df[val_col])
    grouped = df.groupby(group_col)[val_col]
    sq_gmd = (grouped.mean() - overall_mean) ** 2
    return np.sum(sq_gmd * grouped.count())

Check that we get the same answer from the function as we did with the
step-by-step calculation:

In [None]:
get_sn_sq_gmd(diets, 'diet', 'weight_lost')

Next we consider a single trial in our ideal, null, fake world.  We do this by
making a copy of the data frame, and then permuting the diet labels, so
the association between the diet and the change values is random.

In [None]:
fake_data = diets.copy()
# Permute the treatment labels
fake_data['diet'] = np.random.permutation(fake_data['diet'])
fake_data.head()

We calculate our metric on these new data, step by step.

In [None]:
fake_grouped = fake_data.groupby('diet')['weight_lost']
# Notice that the overall_mean cannot change because we did not
# change these values.
fake_sq_gmd = (fake_grouped.mean() - overall_mean) ** 2
fake_sn_sq_gmd = np.sum(fake_sq_gmd * fake_grouped.count())
fake_sn_sq_gmd

We can also use the function above to do that calculation, and get the same
answer:

In [None]:
get_sn_sq_gmd(fake_data, 'diet', 'weight_lost')

Now we are ready to do our simulation.  We do 10000 trials. In each trial, we
make a new random association, and recalculate the sum of squares metric.

In [None]:
n_iters = 10000
fake_sn_sq_gmds = np.zeros(n_iters)
for i in np.arange(n_iters):
    # Make sample from null world.
    fake_data['diet'] = np.random.permutation(fake_data['diet'])
    # Calculate corresponding metric.
    fake_sn_sq_gmds[i] = get_sn_sq_gmd(fake_data, 'diet', 'weight_lost')

Of course, because these are sums of squares, they must all be positive.

In [None]:
plt.hist(fake_sn_sq_gmds, bins=100);

How does our observed sum of squares metric compare to the distribution of fake
sum of square metrics?

In [None]:
p = np.count_nonzero(fake_sn_sq_gmds >= float(observed_sn_sq_gmd)) / n_iters
p

The p value tells us that this observed metric is very unlikely to have come
about in a random world.


## Comparing to standard one-way ANOVA F tests

In this section, we do the standard F-test calculations to show that we get a
similar p value to the permutation version above.  This is the Statsmodels
implementation of the one-way F test:

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

mod = smf.ols('weight_lost ~ diet', data=diets).fit()

sm.stats.anova_lm(mod, typ=1)

Here is the same calculation in Scipy:

In [None]:
from scipy.stats import f_oneway

In [None]:
# Get the values from the individual groups.
treatment = diets['diet']
change = diets['weight_lost']
diet_a_values = np.array(change[treatment == 'A'])
diet_b_values = np.array(change[treatment == 'B'])
diet_c_values = np.array(change[treatment == 'C'])

Do the F-test:

In [None]:
f_result = f_oneway(diet_a_values, diet_b_values, diet_c_values)
f_result

## The F statistic and the SNSQGMD metric

In this section, we go into more detail about the calculation of the F value
that you see above.  Here is the F statistic we got from Scipy (and
Statsmodels):

In [None]:
F_stat = f_result.statistic
F_stat

This section goes through the calculation of the F statistic from the SNSQGMD
metric.  This is the SNSQGMD value we calculated:

In [None]:
observed_sn_sq_gmd

You can get the F statistic above by dividing the SNSQGMD metric by a scaled
estimate of the variation still present in the data.

The variation still present in the data are the remaining distances between the
data (in the plot above) and their corresponding group means.  Call these
remaining distances the "residuals".

Subtract each group mean from their respective group values:

In [None]:
# Calculate the residuals from the group means.
diet_a_resid = diet_a_values - np.mean(diet_a_values)
diet_b_resid = diet_b_values - np.mean(diet_b_values)
diet_c_resid = diet_c_values - np.mean(diet_c_values)

Finally, we sum up the squares of these residuals:

In [None]:
# We concatenate the three sets of residuals into one long array.
all_group_resid = np.concatenate(
    [diet_a_resid, diet_b_resid, diet_c_resid])
# Sum of squared residuals from group means.
ssq_resid_groups = np.sum(all_group_resid ** 2)
ssq_resid_groups

The F-statistic results from dividing this measure of remaining variation into
the SNSQGMD metric, with some scaling.  The scaling comes from the number of
observations, and the number of groups.

In [None]:
n_obs = len(diets)
n_groups = len(group_means)

Here is the full calculation of the F-statistic. Notice that it is exactly the
same as we got from Scipy and Statsmodels.

In [None]:
# Calculate of the F value by scaling and dividing by residual variation
# metric.
df_groups = n_groups - 1  # Degrees of freedom for groups.
df_error = n_obs - n_groups  # Degrees of freedom for residuals.
# The F statistic
(observed_sn_sq_gmd / df_groups) / (ssq_resid_groups / df_error)

Scaling and dividing by the residual variation gives a value that we can reason
about with some standard mathematics, as long as we are prepared to assume that
the values come from a normal distribution.  Specifically, with those
assumptions, we can get a p value by comparing the observed F value to a
standard F distribution with the same "degrees of freedom".  These are the
`df_groups` and `df_error` values above.

In [None]:
# Get standard F distribution object from Scipy.
from scipy.stats import f as f_dist

# Look up p value for a particular F statistic and degrees of freedom.
# Notice the p value is the same as the p value from Scipy f_oneway
# and from Statsmodels.
f_dist(df_groups, df_error).sf(F_stat)

As you have seen, the permutation estimate gives a very similar answer.  We
would argue that it is also a lot easier to explain.

## F tests in terms of explained variation

You will often see explanations of the F-value in terms of the amount of
variation explained by the overall mean, compared to the amount of variation
explained with the individual group means.  In fact, this "variance" way of
thinking is what gave the test the name ANOVA (Analysis of Variance).

The explained variation path (literally) adds up to the same thing as the
SNSQGMD metric version of the F statistic above.  The current section goes
through the explained variation way of thinking of the F statistic, and shows
that it gives the same value for the SNSQGMD metric.

The "variance" way of thinking about the F looks at the sum of squared
"residual" variation in two situations.  First we get the residual variation
when we subtract the group means.  We already have this from the F test
calculation above.  Here we repeat the calculation as a reminder of what the
value means:

In [None]:
ssq_resid_groups = np.sum(all_group_resid ** 2)
ssq_resid_groups

This is the sum of squared remaining variation when using the group means.

We compare this to the squared remaining variation when just using the overall
mean.  Here is that calculation.

In [None]:
# Sum of squared residuals using overall mean
# Subtract the overall mean from the original values to get residuals.
resid_overall = diets['weight_lost'] - overall_mean
# Square and sum the residuals to get the squared variation from overall mean.
ssq_resid_overall = np.sum(resid_overall ** 2)
ssq_resid_overall

The variance way of thinking says that we should be particularly interested in
our group means, when using them does a very good job of reducing the
variation.  This will happen when the group values are a lot closer to their
individual group means, than they are to the overall mean.  In that case,
`ssq_resid_groups` will be much lower than `ssq_resid_overall`, so we will get
a fairly high value for `ssq_resid_overall - ssq_resid_groups`.  The result of
this subtraction is called the *extra sum of squares* explained by the sample
means:

In [None]:
ess = ssq_resid_overall - ssq_resid_groups
ess

Remember, `ssq_resid_overall` is the (sum of squared) variation remaining after
accounting for the overall mean, and `ssq_resid_groups` is the (sum of squared)
variation remaining after accounting for the sample means, so `ess` is the
*extra* variation accounted for by using the sample means.

But — wait — the `ess` value is *exactly* the same as the SNSQGMD metric we
were already using!

In [None]:
observed_sn_sq_gmd

This striking fact is true for any possible values and groups, and arises from
the algebra of adding squared deviations from means.  The equivalence gives us
two different ways of thinking of the same SNSQGMD metric value.  The SNSQGMD
value is both:

* A measure of how far the sample means are from the overall mean, AND
* A measure of how much variation the sample means explain, over and above the
  overall mean.

In this second "explained variance" interpretation, we think of the F test
calculation as being a scaled ratio of the extra variance explained by the
sample means to the variance still remaining when we use the sample means.  If
the sample means explain a lot of variation, then the top half of the F
statistic will be large, and the bottom half will be small, giving a large F
value.