In [None]:
# initialization cell
import numpy as np

### Matrix multiplication

Let
$$
A = \begin{pmatrix}
a_{11} & \dots & a_{1k} \\
\vdots & \vdots & \vdots \\
a_{m1} & \dots & a_{mk}
\end{pmatrix}
,  \qquad
B = \begin{pmatrix}
b_{11} & \dots & b_{1n} \\
\vdots & \vdots & \vdots \\
b_{k1} & \dots & b_{kn}
\end{pmatrix}
$$

Then,
$$
(AB)_{ij} = \sum_{l=1}^k a_{il}b_{lj}.
$$

### Transpose

$$
A^T = \begin{pmatrix}
a_{11} & \dots & a_{m1} \\
\vdots & \vdots & \vdots \\
a_{1k} & \dots & a_{mk}
\end{pmatrix}
$$

## Design matrices

Important examples matrices for us are matrices like
$$
D = \begin{pmatrix}
1 & X_1 \\
\vdots & \vdots \\
1 & X_n
\end{pmatrix}
$$

(*Note:* later on, the design matrix will be called $X$)

In [3]:
# Hooke's law data
X = np.array([0., 2., 4., 6., 8., 10.])
Y = np.array([439.00, 439.12, 439.21, 439.31, 439.40, 439.50])

# design matrix
D = np.vstack([np.ones_like(X), X]).T
D

array([[  1.,   0.],
       [  1.,   2.],
       [  1.,   4.],
       [  1.,   6.],
       [  1.,   8.],
       [  1.,  10.]])

### Matrix-vector multiplication

In [4]:
beta = np.array([1., 2])
print(D.dot(beta))

# An alternative way
print(beta[0] * D[:,0] + beta[1] * D[:,1])

[  1.   5.   9.  13.  17.  21.]
[  1.   5.   9.  13.  17.  21.]


### Inner (dot) product

$$
\langle v, u \rangle = v^Tu = \sum_{i=1}^n v_i u_i
$$

### Euclidean length

$$
\|v\|^2 = \langle v, v \rangle = \sum_{i=1}^n v_i^2
$$

(*Note:* sometimes we write $\|v\|^2_2$ to emphasize that it is Euclidean length
when there are other lengths we might consider.)

In [5]:
# dot product
print(X.dot(Y), (X*Y).sum())

# length squared
print((Y**2).sum(), np.linalg.norm(Y)**2)

(13181.139999999999, 13181.139999999999)
(1157678.6845999998, 1157678.6845999998)


### SSE in matrix form

Let 
$$
\beta = \begin{pmatrix}
\beta_0 \\
\beta_1
\end{pmatrix}.
$$
Then,
$$
D\beta = \beta \cdot 1 + \beta_1 \cdot X
$$
and
$$
\begin{aligned}
SSE(\beta) &= \|Y - D\beta\|^2 \\
&= \|Y\|^2 - 2 Y^T(D\beta) + \beta^T(D^TD)\beta
\end{aligned}
$$

In [6]:
S = D.T.dot(D)
S.shape

(2, 2)

## Least squares estimators in matrix form

### Normal equations

$$
\begin{aligned}
\frac{1}{2} \frac{\partial}{\partial \beta} SSE(\beta) = D^TY - (D^TD)\beta
\end{aligned}
$$

At $\hat{\beta}$:
$$
0 = \frac{1}{2} \frac{\partial}{\partial \beta} SSE(\beta) \biggl|_{\beta=\hat{\beta}} = D^TY - (D^TD)\hat{\beta}.
$$

Or,
$$
\hat{\beta} = (D^TD)^{-1}D^Ty.
$$

### Inverse

$$
AA^{-1} = I = \text{diag}(1,\dots, 1).
$$

In [7]:
Sinv = np.linalg.inv(S)
S.dot(Sinv)

array([[  1.00000000e+00,   0.00000000e+00],
       [ -1.77635684e-15,   1.00000000e+00]])

### Solving the normal equations

We see
$$
\hat{\beta} = (D^TD)^{-1}D^TY,
$$
where
$$
D^TD = \begin{pmatrix} n & \sum_{i=1}^n X_i \\
\sum_{i=1}^n X_i & \sum_{i=1}^n X_i^2
\end{pmatrix}
$$

In [8]:
Sinv.dot(D.T.dot(Y))

array([  4.39010952e+02,   4.91428571e-02])

Compare to:

In [9]:
import statsmodels.api as sm
design = sm.add_constant(X)
sm.OLS(Y,design).fit().params

array([  4.39010952e+02,   4.91428571e-02])

## SVD 

Any $n \times p$ matrix $X$ of rank $k \leq \min(n,p)$ can be written in terms of its [SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition) as
$$
X_{n \times p} = U_{n \times k} D_{k \times k} V^T_{k \times p}
$$
where
$$
U^TU = V^TV = I_{k \times k}
$$
and $D = \text{diag}(D_1 \geq  \dots \geq D_k)$.

In [10]:
u, d, v = np.linalg.svd(D, full_matrices=0)
u.shape, d, v.shape

((6, 2), array([ 14.97084014,   1.36892127]), (2, 2))

In [11]:
u.T.dot(u), v.T.dot(v)

(array([[  1.00000000e+00,   5.55111512e-17],
        [  5.55111512e-17,   1.00000000e+00]]), array([[ 1.,  0.],
        [ 0.,  1.]]))

### Projections

The matrices $P_C = UU^T$ and $P_R=VV^T$ are
- symmetric $P=P^T$
- idempotent $P^2=P$

These are orthogonal projection matrices onto $\text{col}(X)$ (for $P_C$) and $\text{row}(X)$ (for $P_R$), respectively.

In [12]:
P_C = u.dot(u.T)
np.linalg.norm(P_C - P_C.T), np.linalg.norm(P_C.dot(P_C) - P_C)

(0.0, 3.9582117297575385e-16)

The matrix $P_C$ can also be written as
$$
P_C = X(X^TX)^{-1}X^T.
$$

As 
$$
\hat{\beta} = (X^TX)^{-1}X^TY
$$
we see that the vector of fitted values satisfies
$$
X\hat{\beta}= X(X^TX)^{-1}X^TY = P_CY.
$$
That is, the model finds fitted values by projecting $Y$ onto $\text{col}(X)$.

The vector of residuals satisfies
$$
Y- X\hat{\beta} = Y - P_CY = (I - P_C)Y.
$$
The matrix $(I-P_C)$ is also an orthogonal projection matrix (i.e. it is symmetric and idempotent). It corresponds
to projection to the orthocomplement of $\text{col}(Y)$.

In [13]:
R_C = np.identity(P_C.shape[0]) - P_C
print(np.linalg.norm(R_C - R_C.T), np.linalg.norm(R_C.dot(R_C) - R_C))
print np.linalg.norm(R_C.dot(P_C))

(0.0, 4.7479479777112764e-16)
4.66869692243e-16
