# Modeling groups with dummy variables

## Introduction and definitions

In [1]:
#: Import numerical and plotting libraries
import numpy as np
# Print to four digits of precision
np.set_printoptions(precision=4, suppress=True)
import numpy.linalg as npl
import matplotlib.pyplot as plt

We return to the psychopathy of students from Berkeley and MIT.

We get psychopathy questionnaire scores from another set of 5 students from
Berkeley:

In [2]:
#: Psychopathy scores from UCB students
ucb_psycho = np.array([2.9277, 9.7348, 12.1932, 12.2576, 5.4834])
n_ucb = len(ucb_psycho)

We do the same for another set of 5 students from MIT:

In [3]:
#: Psychopathy scores from MIT students
mit_psycho = np.array([7.2937, 11.1465, 13.5204, 15.053, 12.6863])
n_mit = len(mit_psycho)

Concatenate these into a `psychopathy` vector:

In [4]:
#: Concatenate UCB and MIT student scores
psychopathy = np.concatenate((ucb_psycho, mit_psycho))

$\newcommand{\yvec}{\vec{y}}$

The `psychopathy` values will be our `y` vector $\yvec$.

In [5]:
# Give name Y to psychopathy, for reading convenience.
Y = psychopathy
# Call the number of observations "n"
n = len(psychopathy)
n

We will use the general linear model to a run two-level (UCB, MIT) single
factor (university) analysis of variance on these data.

Our model is that the Berkeley student data are drawn from some distribution
with a mean value that is characteristic for Berkeley: $y_i = \mu_{Berkeley} +
e_i$ where $i$ corresponds to a student from Berkeley.  There is also a
characteristic but possibly different mean value for MIT: $\mu_{MIT}$:

$$
\newcommand{\xvec}{\vec{x}}
\newcommand{\evec}{\vec{\varepsilon}}
\newcommand{\Xmat}{\boldsymbol X}
\newcommand{\bvec}{\vec{\beta}}
\newcommand{\bhat}{\hat{\bvec}}
\newcommand{\yhat}{\hat{\yvec}}
$$

$$
y_i = \mu_{Berkeley} + e_i  \space\mbox{if}\space 1 \le i \le 5 \\
y_i = \mu_{MIT} + e_i \space\mbox{if}\space 6 \le i \le 10
$$

We saw in [introduction to the general linear
model](https://textbook.nipraxis.org/glm_intro.html) that we can encode this
group membership with dummy variables.  There is one dummy variable for each
group.  The dummy variables are *indicator* variables, in that they have 1 in
the row corresponding to observations in the group, and zero elsewhere.

We will compile a design matrix $\Xmat$ and use the matrix formulation of the
general linear model to do estimation and testing:

$$
\yvec = \Xmat \bvec + \evec
$$

# ANOVA design

Create the design matrix for this ANOVA, with dummy variables corresponding to
the UCB and MIT student groups:

In [6]:
#- Create design matrix for UCB / MIT ANOVA
X = np.zeros((n, 2))
X[:n_ucb, 0] = 1  # UCB indicator
X[n_ucb:, 1] = 1  # MIT indicator
# Show the result
X

In [7]:
assert X.shape == (n, 2)
assert np.all(np.sum(X, axis=0) == (n_ucb, n_mit))

Remember that, when $\Xmat^T \Xmat$ is invertible, our least-squares parameter
estimates $\bhat$ are given by:

$$
\bhat = (\Xmat^T \Xmat)^{-1} \Xmat^T \yvec
$$

First calculate $\Xmat^T \Xmat$.

In [8]:
#- Calculate transpose of design with itself.
#- Are the design columns orthogonal?
XtX = X.T @ X
# Show the result
XtX

In [9]:
assert XtX.shape == (X.shape[1], X.shape[1])
assert np.allclose(XtX, [[n_ucb, 0], [0, n_mit]])

## For reflection

* Are the columns of this `X` design orthogonal?
* How did we know what the values would be in the test above?

## Estimation

Calculate the matrix inverse of $\Xmat^T \Xmat$.

In [10]:
#- Calculate inverse of transpose of design with itself.
iXtX = npl.inv(XtX)
# Show the result
iXtX

In [11]:
assert iXtX.shape == (2, 2)
assert np.all(iXtX == [[ 1 / n_ucb, 0], [0, 1 / n_mit]])

**For reflection**: How did we know what the values would be in the test above?  Maybe think for a bit, and then see [diagonal matrix inverse](https://matthew-brett.github.com/teaching/diag_inverse.html).

Now continue to calculate the betas.  As you remember, these are given by:

$$
\bhat = (\Xmat^T \Xmat)^{-1}\Xmat^T \yvec
$$

In [12]:
#- Calculate transpose of design matrix multiplied by data, and therefore
#- calculate beta vector
B = iXtX @ X.T @ Y
# Show the result
B

In [13]:
assert B.shape == (2,)

Compare this vector to the means of the values in `ucb_psycho` and
`mit_psycho`:

In [14]:
#- Compare beta vector to means of each group
both_means = [np.mean(ucb_psycho), np.mean(mit_psycho)]
# Show the result
both_means

In [15]:
assert np.allclose(B, both_means)