# Unbalanced two-way ANOVA

This page follows on from the [two-way unbalanced ANOVA
notebook](./twoway_unbalanced.ipynb).

Please make sure you follow the two-way unbalanced ANOVA notebook before you read this notebook, because we are going to re-use notation and machinery
from that notebook.

## Back again to the example


In [1]:
# Array, data frame and plotting libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Statsmodels ANOVA machinery.
import statsmodels.api as sm
import statsmodels.formula.api as smf

We return again to our dataset giving amount of weight lost (in kilograms)
after 10 weeks of one of three possible diets.

As in the one-way ANOVA page, we use the `gender` column to define one factor,
and the `diet` column to define another, so we can classify the rows
(individuals) into `Female` or `Male` (`gender`) and into `A`, `B` and `C`
(`diet`).

See: [the dataset page](https://github.com/odsti/datasets/tree/master/sheffield_diet) for more detail.

In [2]:
# Read the dataset
diet_data = pd.read_csv('sheffield_diet.csv')
diet_data.head()

Unnamed: 0,gender,diet,weight_lost
0,Female,A,3.8
1,Female,A,6.0
2,Female,A,0.7
3,Female,A,2.9
4,Female,A,2.8


Pandas `groupby` can classify each row using *both* the `gender` label (level) and the `diet` label (level).  Notice we have different number is the six possible sub-groups.

In [3]:
grouped = diet_data.groupby(['gender', 'diet'])
# Show the counts in each of the six groups.
grouped['weight_lost'].count()

gender  diet
Female  A       14
        B       14
        C       15
Male    A       10
        B       11
        C       12
Name: weight_lost, dtype: int64

## General linear model notation

Let us say we want to do an F-test for the interaction term.

We partition the design matrix into the main effect columns and the interaction columns, like this:

In [4]:
# Main effects
gender = diet_data['gender']
diet = diet_data['diet']
# Dummy columns for each effect.
g_cols = pd.get_dummies(gender)
d_cols = pd.get_dummies(diet)
# The main effect design matrix.
X1 = np.hstack((g_cols, d_cols))
X1

array([[1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 1, 0, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 0, 0, 1],
       [1, 0, 0,

Next we need the design matrix for the interaction terms:

In [5]:
# Series giving sub-group labels for each row.
sub_groups = gender.str.cat(diet, sep='-')
# Corresponding design matrix modeling mean for each group.
X2 = pd.get_dummies(sub_groups)
X2

Unnamed: 0,Female-A,Female-B,Female-C,Male-A,Male-B,Male-C
0,1,0,0,0,0,0
1,1,0,0,0,0,0
2,1,0,0,0,0,0
3,1,0,0,0,0,0
4,1,0,0,0,0,0
...,...,...,...,...,...,...
71,0,0,0,0,0,1
72,0,0,0,0,0,1
73,0,0,0,0,0,1
74,0,0,0,0,0,1


Full design:

In [6]:
X = np.hstack((X1, X2))

The data we are fitting:

In [7]:
y = diet_data['weight_lost']

Least square parameters for full design:

In [8]:
pX = np.linalg.pinv(X)
B = pX @ y
B

array([ 1.90388709,  2.01770743,  0.92613516,  0.93154642,  2.06391294,
        0.21997775, -0.22829064,  1.91219998,  0.70615741,  1.15983706,
        0.15171296])

Fitted y values:

In [9]:
y_hat = X @ B
y_hat

array([3.05      , 3.05      , 3.05      , 3.05      , 3.05      ,
       3.05      , 3.05      , 3.05      , 3.05      , 3.05      ,
       3.05      , 3.05      , 3.05      , 3.05      , 2.60714286,
       2.60714286, 2.60714286, 2.60714286, 2.60714286, 2.60714286,
       2.60714286, 2.60714286, 2.60714286, 2.60714286, 2.60714286,
       2.60714286, 2.60714286, 2.60714286, 5.88      , 5.88      ,
       5.88      , 5.88      , 5.88      , 5.88      , 5.88      ,
       5.88      , 5.88      , 5.88      , 5.88      , 5.88      ,
       5.88      , 5.88      , 5.88      , 3.65      , 3.65      ,
       3.65      , 3.65      , 3.65      , 3.65      , 3.65      ,
       3.65      , 3.65      , 3.65      , 4.10909091, 4.10909091,
       4.10909091, 4.10909091, 4.10909091, 4.10909091, 4.10909091,
       4.10909091, 4.10909091, 4.10909091, 4.10909091, 4.23333333,
       4.23333333, 4.23333333, 4.23333333, 4.23333333, 4.23333333,
       4.23333333, 4.23333333, 4.23333333, 4.23333333, 4.23333

Errors:

In [10]:
e = y - y_hat
e

0     0.750000
1     2.950000
2    -2.350000
3    -0.150000
4    -0.250000
        ...   
71   -1.433333
72   -0.133333
73    1.066667
74    4.966667
75    1.866667
Name: weight_lost, Length: 76, dtype: float64

Sum of squared errors:

In [11]:
np.sum(e ** 2)

376.3290432900432

Compare to residual sum of squares value in Statsmodel ANOVA table:

In [12]:
# Fit
sm_fit = smf.ols('weight_lost ~ gender * diet', data=diet_data).fit()
# Type II (2) sum of squares calculation ANOVA table.
sm.stats.anova_lm(sm_fit, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
gender,0.168696,1.0,0.031379,0.85991
diet,60.41722,2.0,5.619026,0.005456
gender:diet,33.904068,2.0,3.153204,0.048842
Residual,376.329043,70.0,,


We can write that all in one go:

In [13]:
e_again = y - X @ pX @ y
np.sum(e_again ** 2)

376.3290432900432

That is also the same calculation as:

In [14]:
H = X @ pX
y_hat_again = H @ y
e_3 = (y - y_hat_again)
np.sum(e_3 ** 2)

376.3290432900432

The $H$ matrix above is called the *hat* matrix, because it "puts the hat on"
the y values, meaning, it calculates the best fit predictions for the y values.

In fact, using our matrix algebra, we can write the overall calculation you see in the `e_again` cell above, like this:

In [15]:
n = len(y)
# Because:
# y - X @ pX @ y  === y - H @ y === (I - H) @ y
rfm = np.eye(n) - H
e_4 = rfm @ y
np.sum(e_4 ** 2)

376.32904329004316

`rfm` above is the *residual forming matrix*, so called because, as you see, it includes the whole matrix calculation to turn the `y` values into the errors (residuals) with one matrix multiplication.  Multiplying with this matrix extracts all the information that can be modeled by the design matrix `X`.