source: [video](https://www.khanacademy.org/math/statistics-probability/analysis-of-variance-anova-library/analysis-of-variance-anova/v/anova-1-calculating-sst-total-sum-of-squares)

### ANOVA 
* Analysis of Variance
* SST (Sum of Squares Total)

In [30]:
data = [[3, 2, 1], [5, 3, 4], [5, 6, 7]]
points_per_sample = 3

# we can take mean of all of the data in each of the groups 
flattened = [x for xs in data for x in xs]
grand_mean = sum(flattened)/len(flattened)

# or we can take the mean of the means 
mean_of_means = sum([sum(x)/len(x) for x in data])/len(data)

# and they should be the same
assert grand_mean == mean_of_means

# to get the SST
SST = sum([(x - grand_mean)**2 for x in flattened])

print(f"\nSST: {SST}\n")

# How many degrees of freedom do we have? 
# If we know the mean of means, then there are only m*n - 1 independent
# measurements for n data points--if we want to calculate the variance 
# for the above data, we would divide SST by m*n-1

degrees_of_freedom = len(data) * len(data[0]) - 1
variance = SST / degrees_of_freedom 
print(f"\nVariance for entire group of data: {variance}")


SST: 30.0


Variance for entire group of data: 3.75


* So we know there is a variance in the total group of 9 data points above, but some variation may come from between the groups and some from within the groups

In [20]:
# SSW (Sum of Squares Within): find the distance of each data point 
# from its respective group mean 
means = [sum(x) / len(x) for x in data]
SSW = sum([sum((xi - mean)**2 for xi in group) for group, mean in zip(data, means)])
print(f"\nSSW: {SSW}\n")


SSW: 6.0



* We have calculated the SSW as 6 and the total variance as 30, so we can think of it as 6 of the total variance comes from variance within the group samples
* For each group we have n data points, which means we have n - 1 degrees of freedom. We have n-1 degrees of freedom because if we know the mean of the n data points, we can calculate the nth data point using only the mean and the other n - 1 points
* In all of our data, we have $m\cdot(n-1)$ degrees of freedom, or 6 degrees
* Now we want to calculate the variance that comes from between the sample groups, or between the means (the central tendencies of each group)

In [35]:
# Sum of Squares Between: variation due to the difference between the group means
SSB = sum([((means[i] - mean_of_means)**2)*points_per_sample for i in range(len(means))])
print(f"\nSSB: {SSB}\n")

print(f"\ntotal variance\nSST = SSB + SSW = {SSB + SSW}\n")


SSB: 24.0


total variance
SST = SSB + SSW = 30.0



* Now imagine that each of the groups are the results of some experiment: I gave three different types of food to each group to people taking a test, and the numbers are the scores on a test 
* I want to know if the food taken actually affects the scores, or is the difference in the means between the groups just due to random chance
* Are the true population means the same? $\mu_1 = \mu_2 = \mu_3$
    * If they are not equal, then the food does have some effect 
* $H_0$: food doesn't make a difference -> $\mu_1 = \mu_2 = \mu_3$
* $H_1$: it does
* Assume $H_0$. 
* Use F-Statistic (F comes from Fisher), which is the ratio of SSB and SSW each over their respective degrees of freedom $$\frac{\frac{SSB}{m-1}}{\frac{SSW}{m(n-1)}}$$
    * Now if the numerator is much larger than the denominator, we know that the variation is mostly due to differences between the actual means, and less to the variation within the groups--this signifies there is a difference in the true populations, making it easier to reject the null hypothesis
    * If the denominator is larger, that means the variation within each sample is more significant, which tells us that any difference within the means is likely just random, making it harder to reject the null hypothesis
    