# Maximum variance formulation

consider a dataset $\{x_{i}\}$ where $i=1,...,n$ and $x_{i} \in \mathbb{R}^{d}$.

our goal is to project the data onto a space having dimensionality $k < d$ while maximizing the variance of the projected data. 

to begin with, consider the projection onto a one-dimensional space$(k=1)$.

we can define the direction of this space by a vector $u_{1} \in \mathbb{R}^{d}$, we can choose $u_{1}$ to be a unit vector so that $u_{1}^{T}u_{1} = 1$.

each data point $x_{i}$ is then projected onto a scalar value $u_{1}^{T}x_{i}$, then mean of the projected data:

$$\frac{1}{n}\sum_{i=1}^{n}u_{1}^{T}x_{i} = u_{1}^{T}\overline{x}$$

the variance of the projected data:

$$\frac{1}{n}\sum_{i=1}^{n}(u_{1}^{T}x_{i} - u_{1}^{T}\overline{x})^{2} = \frac{1}{n}\sum_{i=1}^{n}u_{1}^{T}(x_{i} - \overline{x})(x_{i} - \overline{x})^{T}u_{1} = u_{1}^{T}Su_{1}$$

where

$$S = \frac{1}{n}\sum_{i=1}^{n}(x_{i} - \overline{x})(x_{i} - \overline{x})^{T}$$

now we can formalize our problem as:

$$\underset{u_{1}}{min}\ -u_{1}^{T}Su_{1}$$
$$s.t\quad u_{1}^{T}u_{1} = 1$$

the lagrangian of this optimization problem:

$$L(u_{1}, \lambda_{1}) = -u_{1}^{T}Su_{1} + \lambda_{1}(u_{1}^{T}u_{1} - 1)$$

the primal:

$$\underset{u_{1}}{min}\ \underset{\lambda_{1}}{max}\ L(u_{1}, \lambda_{1})$$

primal satisfy the KKT conditions, so equivalent to dual:

$$\underset{\lambda_{1}}{max}\ \underset{u_{1}}{min}\ L(u_{1}, \lambda_{1})$$

setting the derivative with respect to $u_{1}$ equal to zero, we have:

$$Su_{1} = \lambda_{1}{u_{1}}$$

which say that $u_{1}$ must be a eigenvector of $S$, if we left-multiply by $u_{1}^{T}$ and make use of $u_{1}^{T}u_{1} = 1$, we get:

$$u_{1}^{T}Su_{1} = \lambda_{1}$$

and so the variance will be a maximum when we set $u_{1}$ equal to the eigenvector having the largest eigenvalue $\lambda_{1}$. this eigenvector is known as the first principal component.

we can define the additional principal components in an increamental fashion by choosing each new direction to be that which maximizes the projected variance amongst all possible directions orthogonal to those already considered.

second principal component:

$$\underset{u_{2}}{min}\ -u_{2}^{T}Su_{2}$$
$$s.t\quad u_{2}^{T}u_{2} = 1, u_{1}^{T}u_{2} = 0$$

like before, using lagrangian we derive:

$$Su_{2} = \lambda_{2}{u_{2}} + \phi{u_{1}}$$

left multiply by $u_{1}^{T}$:

$$u_{1}^{T}Su_{2} = \lambda_{2}u_{1}^{T}{u_{2}} + \phi{u_{1}^{T}}{u_{1}}$$

analyzing each component:

$$u_{1}^{T}Su_{2} = u_{2}^{T}Su_{1} = u_{2}^{T}\lambda_{1}u_{1} = \lambda{u_{1}^{T}u_{2}} = 0$$
$$u_{1}^{T}{u_{2}} = 0$$
$${u_{1}^{T}}{u_{1}} = 1$$

we get:

$$\phi = 0$$

back to zero derivative we have:

$$Su_{2} = \lambda_{2}{u_{2}}$$
$$u_{2}^{T}Su_{2} = \lambda_{2}$$

so $\lambda_{2}$ is the second largest eigenvector of $S$.

by induction, we can show that $i$-th principal component is the $i$-th largest eigenvector of $S$.

# properties of non-negative definite symmetric real matrix

$$S = \frac{1}{n}\sum_{i=1}^{n}(x_{i} - \overline{x})(x_{i} - \overline{x})^{T}$$ 

is of that kind.

# Minimum-error formulation

a complete orthonormal basis vectors $u_{i}$ in $\mathbb{R}^{d}$:

$$u_{i}^{T}u_{j} = \delta_{ij}$$

$x_{k}$ coordinate with respect to $u_{i}$ is $x_{k}^{T}u_{i}$, so:

$$x_{k} = \sum_{i=1}^{d}(x_{k}^{T}u_{i})u_{i}$$

$x_{k}$ can be approximated by the $m$-dimensional subspace representation plus a constant:

$$\tilde{x}_{k} = \sum_{i=1}^{m}z_{ki}u_{i} + \sum_{i=m+1}^{d}b_{i}u_{i}$$

where $z_{ki}$ depend on the particular data point, whereas ${b_{i}}$ are constants that are the same for all data points.

our goal is to minimize:

$$J = \frac{1}{n}\sum_{k=1}^{d}\left \| x_{k} - \tilde{x}_{k} \right \|^{2} $$

setting the derivative with respect to $z_{ni}$ to zero, and making use of the orthonormality conditions, we obtain:

$$z_{ni} = x_{n}^{T}u_{i}$$

similarly, we obtain:

$$b_{i} = \overline{x}^{T}u_{i}$$

substitude for $z_{ni}$ and $b_{i}$, we obtain:

$$x_{k} - \tilde{x}_{k} = \sum_{i=m+1}^{d}((x_{k} - \overline{x}_{k})^{T}u_{i})u_{i}$$

finally our goal is to minimize:

$$J = \frac{1}{n}\sum_{k=1}^{n}\sum_{i=m+1}^{d}(x_{k} - \overline{x}_{k})^{2} = \sum_{i=m+1}^{d}u_{i}^{T}Su_{i}$$

this is similar to the maximum variance formulation in the opposite direction.