# Principal Components Analysis

K. Leighly 2017

This lecture was drawn from the following sources:
 - Ivezic chapter 7
 - Bishop Chapter 12
 - Jolliffe "Principal Components Analysis"
 - Numerical Recipes


## Motivation - The Curse of Dimensionality

Chapter 7 in Ivezic is called "Dimensionality and its Reduction".  What is meant by this?  A clear discussion is included in Ivezic section 7.1.

_The curse of dimensionality impacts the size of the data set required to constrain a model._ 

Example: Imagine that you want to buy a car.  You want to buy it off the lot rather than order it. At the same time, you want to specify lots of different options. Each option has a probability between zero and 1 of being represented on a given car dealer's lot. But the product rule of probability ensures that the probability of finding a car with $N$ options becomes small, as the probability of the car that you want would be $P=p_1\times p_2 \times \ldots \times p_N$.

Another way to look at this is by asking the question: how much data do I need to constrain a multi-parameter model? 

Another way to look at this is to think about the following problem.

Imagine that points are distributed uniformly through a volume. What proportion of points fall within 1 unit distance from the origin? A good way to look at this is to estimate the ratio of the volume of a hypersphere to the volume of the side-length=2 hypercube that it is embedded in. 

- For two dimensions:

  $$f_2=\frac{\pi r^2}{(2r)^2} = \pi/4 \approx 78.5%.$$

- For three dimensions it is 52.3%. 

It can be show that the general form of a hypersphere is:

$$V_D(r) =\frac{2 r^D \pi^{D/2}}{D \Gamma(D/2)},$$

where $\Gamma(z)$ is the complete gamma function. Then the ratio becomes

$$f_d=\frac{V_D(r)}{(2r)^D} = \frac{\pi^{D/2}}{D 2^{D-1}
\Gamma(D/2)}.$$

As $D \rightarrow \infty$, $f_D$ goes to zero. _This means that the number of points in a data set required to evenly sample this hypervolume will grow exponentially with dimension._

This calculation motivates the need for dimensionality reduction, which Google defines as "the process of reducing the number of variables under consideration". We can't handle all those variables, so let's get rid of some of them, for example, if some of them are redundant.

Fortunately, there are often good physical reasons for reducing the dimensionality of a problem. The example that Ivezic gives is the following. The SDSS (imaging) comprises a sample of 357 million sources. Each source has 448 measured attributes. If we took it upon ourselves to analyze all those sources with all those attributes, it would be impossible.   

Quoting Louis Carroll:

> The Walrus and the Carpenter  
> Were walking close at hand;  
> They wept like anything to see  
> Such quantities of sand:  
> "If this were only cleared away,"  
> They said, "it would be grand!"  
> 
> "If seven maids with seven mops  
> Swept it for half a year.  
> Do you suppose," the Walrus said,  
> "That they could get it clear?"  
> "I doubt it," said the Carpenter,  
> And shed a bitter tear.  

But a person might reasonably not choose to study all those sources, and in addition, many of the attributes might be correlated or degenerate.

Bishop gives an additional example. 

Consider $64\times 64$ digital images of the numeral "3" embedded in a $100\times 100$ matrix, where the position and rotation of the embedded matrix is randomly varied. 

The data required to specify the resulting images is a point in $100\times 100 = 10,000$ dimensional space, so in principle, you need 10,000 pieces of data to specify the relationships between the panels. 

But in fact, the variation among the images has just three degrees of freedom - two degrees to represent the location, and one to represent the rotation. The data points then live on a subspace of data space who's intrinsic dimensionality is three.

There are plenty of similar examples, and many examples of the use of PCA throughout astronomy. Moreover, sometimes we do PCA not only to make a problem more tractable but also to try to infer the underlying physics. We will talk about some examples of this, as well as the many other uses for PCA.

Bishop Chapter 12 does a very nice job of developing the PCA. Since this technique is quite important and commonly used, we will follow their development, and whack at the topic from various directions. Ivezic provides a subset of this lecture material.



## Principal Component Analysis - General Points

To understand what we are after in dimensionality reduction, it helps to imagine the dependent-variable data situation.

Imagine that you have a large, $D$-dimensional data space. The attributes defining this space include _several mutually dependent variables_.  Plotting these variables in the $D$-dimensional space would produce a long cigar-shaped cloud of points.

Scientifically, we may be interested in the distribution of these points in our $D$-dimensional space. 

**A equally good description of the data would be attained by a rotation of the axes along and perpendicular to the long axis of the cigar.** Arguably, the rotated coordinates provide a more natural description of the data. 

![](http://www.nlpca.org/fig_pca_principal_component_analysis.png)

How can we identify the rotated coordinate system?   There are two ways:

- **Maximum Variance:** Identify the direction where the _variance is the largest_ (the long axis of the data cigar). That will define the first coordinate, and subsequent ones can be defined from it.

- ** Minimum Distance:** An equivalent, complementary method is to find the direction where the _distance of the data points from the new vector is minimized_. 

We will follow Bishop and look at these two developments in turn. Ivezic provides the derivation using the maximum variance.



## Maximum Variance Formulation

Consider a data set of observations $\{\mathbf{x}_n\}$ where $n=1,\ldots,N$, and $\mathbf{x}_n$ is a data point with dimensionality $D$. 

The goal is to project the data onto a space with dimensionality $M$, where $M <D$. 

In order to best describe the data after this transformation, given the fact that we will lose some dimensions, it is best to maximize the variance of the projected data; then, presumably, the information lost in the ignored dimensions will not be critical.

Another way to think about this is that the directions of maximum variance represent real differences from object to object, while the directions with small variance may represent the variance due to noise.  Clearly it is safe to ignore the variance due to noise.

We will assume initally that the value of $M$ is given.  Methods to determine the appropriate value of $M$ will be discussed later.

First consider projection onto a one-dimensional space ($M=1$). Define the direction of this space using $D$-dimensional vector $\mathbf{u}_1$ which is a unit vector, so that $\mathbf{u}_1^T \mathbf{u}_1 = 1$.  ($\mathbf{u}_1$ is yet unknown; it will be determined so that it lies long the direction of greatest variance.)

Each data point $\mathbf{x}_n$ is then projected onto $\mathbf{u}_1$ by $\mathbf{u}_1^T \mathbf{x}_n$ (this is just the dot product). 

The mean of the projected data is $\mathbf{u}_1^T \mathbf{\bar{x}}$, where $\mathbf{\bar{x}}$ is the sample set mean, given by:

$$\mathbf{\bar{x}} = \frac{1}{N} \sum_{n=1}^N \mathbf{x}_n$$

and the variance of the projected data is

$$\frac{1}{N} \sum_{n=1}^{N} \left \{\mathbf{u_1}^T
\mathbf{x}_n-\mathbf{u}_1^T \mathbf{\bar{x}}\right \}^2 =
\mathbf{u}_1^T \mathbf{S} \mathbf{u}_1,$$

where the data covariance matrix is defined by (outer product here)

$$\mathbf{S}=\frac{1}{N} \sum_{n=1}^{N} (\mathbf{x}_n -
\mathbf{\bar{x}})(\mathbf{x}_n-\mathbf{\bar{x}})^T. $$

It is actually not hard to show the above equations are true.

Now we maximize $\mathbf{u}_1^T \mathbf{S} \mathbf{u}_1$ with respect to $\mathbf{u}_1$. This is a constrained maximization because $\mathbf{u}_1$ is a unit vector. Constrained maximizations are done using Lagrange multipliers.

### Aside: Lagrange multipliers

Since this is the second time that we have seen this method, we need to review how it works. This follows Mathews & Walker "Mathematical Methods of Physics".

Lagrange multipliers are used when you need to maximize the value of a function _subject to conditions expressed by another function_.

Suppose we want to maximize a function of two variables $f(x,y)$. We need to satisfy the conditions $f_x=0$ and $f_y=0$, where $f_x$ is shorthand for $\partial f/\partial x$.

We will maximize $f(x,y)$ subject to the condition $g(x,y)=$constant. The idea is illustrated by the following figure, taken from Wikipedia. The idea is that the solution must lie on the red line, i.e., a constrained optimization, rather than lying at the absolute peak of the distribution.

![](https://upload.wikimedia.org/wikipedia/commons/5/55/LagrangeMultipliers3D.png)


To begin with

$$df=f_x dx + f_y dy =0.$$

If $dx$ and $dy$ were independent, then we would have $f_x=f_y=0$, and this problem would proceed along the usual lines for an unconstrained maximization. But they are not independent, rather they are constrained by the equation involving $g$ such that $dx$ and $dy$ are related by

$$dg=g_x dx + g_y dy = 0.$$

Combining these two equations, we find:

$$f_x/g_x = f_y / g_y.$$

If we call the common ratio $\lambda$, we have:

$$f_x - \lambda g_x =0$$
and
$$f_y-\lambda g_y = 0.$$

These are just the equations that would result if we tried to maximize the function $f-\lambda g$ without the constraint. $\lambda$, a constant, is called the Lagrange Multiplier. Note that the results will depend on $\lambda$, and it is adjusted so that $g(x,y)$ takes on the correct value.

Back to our problem. The Lagrange multiplier method then leads to the maximization of this function with respect to $\mathbf{u}_1$:

$$\mathbf{u}_1^T \mathbf{S} \mathbf{u}_1 + \lambda_1 (1-\mathbf{u}_1^T
\mathbf{u_1}).$$

Taking the derivative with respect to $\mathbf{u}_1$ and setting it equal to zero yields

$$\mathbf{S} \mathbf{u}_1 =\lambda_1 \mathbf{u}_1$$

This type of equation should be familiar to everyone - it is an eigenvector equation. 

So this says that  $\mathbf{u}_1$ must be an eigenvector of $\mathbf{S}$. We can then multiply by $\mathbf{u}_1^T$ and we will then find

$$\mathbf{u}_1^T \mathbf{S} \mathbf{u}_1 = \lambda_1$$

which says in turn that variance will be the maximum when we set $\mathbf{u}_1$ equal to the eigenvector having the largest $\lambda_1$. This eigenvector is known as the first principal component.

Let us pause to reflect upon what we have here.

- We are trying to locate the direction of greatest variance.  If $\mathbf{u}_1$ is along that direction, then multiplying it by the covariance matrix (i.e., performing a rotation), then dotting it into itself ($\mathbf{u}_1^T \mathbf{S} \mathbf{u}_1$) will produce the maximum value attainable for all choices of possible $\mathbf{u}_1$.  

- So we can view the eigenvectors as defining a new, rotated coordinate system.

On can imagine finding additional principal components in an incremental fashion by choosing each new direction to be the one that maximizes the variance orthogonal to the previously determined principal components. 

So, for the $M$-dimensional projection space, the optimal linear projection for which the variance of the projected data is maximized is defined by the $M$ eigenvectors $\mathbf{u}_1,\ldots,\mathbf{u}_M$ of the data covariance matrix  $\mathbf{S}$ corresponding to the $M$ largest eigenvalues $\lambda_1,\ldots,\lambda_M$.

So, principal components analysis involves evaluating the mean $\mathbf{\bar{x}}$ and covariance matrix $\mathbf{S}$ and then finding the eigenvectors corresponding to the $M$ largest eigenvalues. You can do this yourself using tools for finding eigenvectors and eigenvalues available in numpy etc; see Section 7.3.1 in Ivezic.  More details are provided below.  You will be asked to do this on the homework.


## Minimum-error Formulation

Let's look at PCA in the complementary way - through projection error minimization. That is, if I find have a candidate basis vector, and project the data points onto this vector, what  direction should it have in order to produce the _smallest projected distance_ to the data points.

If you think about it,  you can see why this is an equivalent formulation to the maximum variance method.

Consider a complete orthnormal set of $D$-dimensional basics vectors $\{\mathbf{u}_i\}$, where $i=1,\ldots,D$ that satisfy

$$\mathbf{u}_i^T \mathbf{u}_i = \delta_{ij}.$$

Because the basis set is complete, each data point can be represented exactly by a linear combination of basis vectors

$$\mathbf{x}_n = \sum_{i=1}^D \alpha_{ni} \mathbf{u}_i$$

where the coefficients $\alpha_{ni}$ are unique to each data point. 

This represents simply a rotation to a new coordinate system defined by $\{\mathbf{u}_i\}$ with the original $D$ components $\{x_{n1},\ldots,x_{nD}\}$ replaced by the equivalent set $\{\alpha_{n1},\ldots\,\alpha_{nD}\}$. 

The $\alpha_{nj}$ can be found for a particular data point indexed by $n$ by projecting $\mathbf{x}_n$ onto the new basis vectors, i.e., $\alpha_{nj}=\mathbf{x}_n^T \mathbf{u}_j$. We can then eliminate the $\alpha_{ni}$ and write

$$\mathbf{x}_n = \sum_{i=1}^D ( \mathbf{x}_n^T \mathbf{u}_i)
\mathbf{u}_i.$$

The goal, however, is to reduce the dimensionality by using a representation involving a restricted number $M < D$ of variables corresponding to a projection onto a lower-dimensional subspace. Let's approximate $\mathbf{x}_n$ with the first $M$ basis vectors as follows:

$$\tilde{\mathbf{x}}_n = \sum_{i=1}^{M} z_{ni} \mathbf{u}_i +
\sum_{i=M+1}^D b_i \mathbf{u}_i$$

where the $\{z_{ni}\}$ depend on the particular data point, whereas the $\{b_i\}$ are constants that are the same for all the data points. 

We choose the $\{\mathbf{u}_i\}$, the $\{z_{ni}\}$ and the $\{b_i\}$ to minimize the distortion introduced by the reduction in dimensionality.  

What can we use for the distortion?  The distortion measure that we will minimize will be the squared distance between the original data point $\mathbf{x}_n$ and its approximation $\tilde{\mathbf{x}}_n$, averaged over the data set, i.e.,

$$J = \frac{1}{N} \sum_{n=1}^N \lVert \mathbf{x}_n -
\tilde{\mathbf{x}}_n \rVert^2.$$

This is a problem in solving minimum least squares.

We minimize $J$ with respect to the various coefficients in turn. 

To minimize with respect to $\{z_{ni}\}$, we substitute in for $\tilde{\mathbf{x}}_n$ above, and set the derivative with respect to $\{z_{ni}\}$ equal to zero, and using $\mathbf{u}_i^T \mathbf{u}_j = \delta_{ij}$, we obtain (I was able to derive this):

$$z_{nj} = \mathbf{x}_n^T \mathbf{u}_j$$

where $j=1,\ldots,M$.

Next, set the derivative of $J$ with respect to $b_i$ equal to zero, and that yields

$$b_j=\bar{\mathbf{x}}^T \mathbf{u}_j$$

where $j=m+1,\ldots,D$, and recalling that $\bar{\mathbf{x}} = (1/N) \sum_{n=1}^N \mathbf{x}_n$. 

Now, substitute these back in to the original equation for $\tilde{\mathbf{x}}$, we obtain

$$\mathbf{x}_n-\tilde{\mathbf{x}}_n = \sum_{i=M+1}^D \left\{ (\mathbf{x}_n-\bar{\mathbf{x}})^T \mathbf{u}_i \right\} \mathbf{u}_i.$$

(This was also relatively straightforward to derive.) This result makes sense because it is saying that the difference between our estimate of $\mathbf{x}_n$ and the real value depends only on the projections against the basis vectors **not** included in the estimate. 

The approximation $\tilde{\mathbf{x}}$  lies in the principal subspace, and the error must lie in the subspace orthogonal to that. So the distortion measure becomes

$$ J=\frac{1}{N} \sum_{n=1}^N \sum_{i=M+1}^D (\mathbf{x}_n^T \mathbf{u}_i - \bar{\mathbf{x}}^T \mathbf{u}_i)^2 $$

$$ =\sum_{i=M+1}^D \mathbf{u}_i^T \mathbf{S} \mathbf{u}_i.$$

This revised formulation for $J$ takes into account the minimization with respect to $\{z_{ni}\}$ and $b_i$, but we still need to minimize $J$ with respect to $\{\mathbf{u}_i\}$.

Again, we need to solve this with the constraint $\mathbf{u}_i^T \mathbf{u}_i = 1$, and therefore need to again use a Lagrange multiplier. Bishop approaches the general problem by first considering just the $D=2$ case, with a one-dimensional subspace $M=1$. In this case, the contributions to $J$ come from $\mathbf{u}_2$, i.e., $i=M+1=2$. We choose the direction $\mathbf{u}_2$ to minimize $J=\mathbf{u}_2^T \mathbf{S} \mathbf{u}_2$, subject to the normalization constraint $\mathbf{u}_2^T \mathbf{u}_2 = 1$. The appropriate Lagrange multiplier equation is:

$$\tilde{J} = \mathbf{u}_2^T \mathbf{S} \mathbf{u}_2 + \lambda_2
(1-\mathbf{u}_2^T \mathbf{u}_2).$$

Setting the derivative with respect to $\mathbf{u}_2$ to zero, we obtain $\mathbf{S}\mathbf{u}_2 =\lambda_2 \mathbf{u}_2$, i.e., $\mathbf{u}_2$ is an eigenvector of $\mathbf{S}$ with eigenvalue $\lambda_2$. So, any eigenvector yields a maximum or minimum of $J$. To find the value of $J$ at the minimum, substitute the solution for $\mathbf{u}_2$ back in to yield $\tilde{J} = \lambda_2$. 

Thus, the minimum value of $J$ is obtained when $\mathbf{u}_2$ is chosen to be the eigenvector corresponding to the smaller of the two eigenvalues. Conversely, then, we should choose the principal subspace to be aligned along $\mathbf{u}_1$, which corresponds to the larger of the two eigenvalues. This result underlines the critical role that the eigenvalues and eigenvectors take in principal compnent analysis.

Generalizing, the minimization of $J$ for arbitrary $D$ and arbitrary $M < D$ is obtained by choosing the $\{\mathbf{u}_i\}$ to be the eigenvectors of the covarience matrix given by

$$\mathbf{S}\mathbf{u}_i = \lambda_i \mathbf{u}_i,$$

where $i=1,\ldots,D$. The corresponding value of the distortion measure is then

$$J=\sum_{i=M+1}^D \lambda_i,$$

i.e., simply the sum of the eigenvalues of the eigenvectors that are orthogonal to the principal subspace.

So, the bottom line is:

1. Obtain the minimum value of $J$ by choosing these eigenvectors to be the ones having the $D-M$ smallest eigenvalues.
2. Use the eigenvectors corresponding to the $M$ largest eigenvalues to define the principal subspace.

So we have gotten to the same answer using two methods. It is interesting to compare the two approaches to appreciate their complementarity:

 - Maximum variance formulation: Find the direction of maximum variance, which represent the principal eigenvector.  The associated eigenvalue will have the largest value of al the eigenvalues. 
 - Minimum error formulation: Find the directions with the smallest eigenvalues.  These have the smallest variance, and an be safely neglected.

## How to Compute Principal Components

Now that we know what principal components analysis is, how do we actually do it? 
There are a number of methods, but we will discuss two basic methods first, and another, more flexible method later. 

The first is to compute the eigenvectors of the covariance matrix. That we use the covariance matrix makes sense from our understanding of PCA - we are looking for the directions of greatest variance. 

The other method is a linear algebra method called singular value decomposition. We will go through each in turn. This discussion partly follows a PCA tutorial found [here](https://arxiv.org/abs/1404.1100) (This tutorial also gives a simple explanation for PCA), but includes material from Bishop, Jolliffe, and Numerical Recipes.

### Eigenvector and Eigenvalue Method

Extracting the eigenvectors and eigenvalues is done through an eigenvalue decomposition of the covariance or correlation matrix. 

Which should you use? This turns out to be a somewhat complicated question that we may discuss later. The brief answer is that if the data points have different units or different scales, you may be better off using correlation, because otherwise the first eigenvector will be dominated by the data that has the largest numerical variance (even if its scaled variance is the same other data). (See also the discussion in Ivezic 7.3.1.)  (And there is always the data analyst's friend: try it both ways.)

Let the data set equal $\mathbf{X}$, an $m \times n$ matrix, where $m$ is the number of parameters that are measured, and $n$ is the number of measurements. The goal is to find some orthonormal matrix $\mathbf{P}$ for $\mathbf{Y} = \mathbf{PX}$ such that $\mathbf{C}_Y = \frac{1}{n} \mathbf{Y}\mathbf{Y}^T$ is a diagonal matrix. The rows of $\mathbf{P}$ are the principal components of $\mathbf{X}$. Here, $\mathbf{C}_Y$ is the correlation matrix that has been rotated along the eigenvectors. So, the off-diagonal terms in $\mathbf{C}_Y$ are zero, and the elements along the diagonal are the eigenvalues.

To see how this works write:

\begin{equation}
\begin{split}
\mathbf{C}_y & = \frac{1}{n} \mathbf{Y} \mathbf{Y}^T \\
& =\frac{1}{n} (\mathbf{PX}) (\mathbf{PX})^T \\
& =\frac{1}{n} \mathbf{P}\mathbf{X}\mathbf{X}^T\mathbf{P}^T \\
& = \mathbf{P}( \frac{1}{n} \mathbf{X} \mathbf{X}^T ) \mathbf{P}^T \\
& =\mathbf{P} \mathbf{C}_X \mathbf{P}^T
\end{split}
\end{equation}

where $\mathbf{C}_X$ is the covariance matrix of $\mathbf{X}$, and where we have used the linear algebra identity that $(\mathbf{A}\mathbf{B})^T = \mathbf{B}^T \mathbf{A}^T$.

It is a fact of linear algebra that any symmetric matrix $\mathbf{A}$ is diagonalized by an orthogonal matrix of its eigenvectors. Moreover, $\mathbf{A} = \mathbf{E} \mathbf{D} \mathbf{E}^T$, where $\mathbf{D}$ is a diagonal matrix and $\mathbf{E}$ is a matrix of eigenvectors arranged as columns.

Choose the matrix $\mathbf{P}$ to be a matrix where each row $\mathbf{p}_i$ is an eigenvector of $\frac{1}{n} \mathbf{X} \mathbf{X}^T$. Then, $\mathbf{P} \equiv \mathbf{E}^T$. 

In addition, there is a theorem which holds that the inverse of an orthogonal matrix is its tranpose, so $\mathbf{P}^{-1} = \mathbf{P}^T$. Thus

\begin{split}
\mathbf{C}_y & = \mathbf{P} \mathbf{C}_X \mathbf{P}^T \\
& = \mathbf{P}(\mathbf{E}^T \mathbf{D} \mathbf{E}) \mathbf{P}^T \\
& = \mathbf{P} (\mathbf{P}^T \mathbf{D} \mathbf{P}) \mathbf{P}^T \\
& = (\mathbf{P} \mathbf{P}^T) \mathbf{D} (\mathbf{P} \mathbf{P}^T) \\
& = (\mathbf{P}\mathbf{P}^{-1}) \mathbf{D} (\mathbf{P}\mathbf{P}^{-1})
\\
& = D
\end{split}



Thus, our choice of $\mathbf{P}$ diagonalizes $\mathbf{C}_y$.  So:

1. The principal components of $\mathbf{X}$ are the eigenvectors of $\mathbf{C}_X = \frac{1}{n} \mathbf{X}\mathbf{X}^T$.
2. The $i^{th}$ diagonal value of $\mathbf{C}_Y$ is the variance of $\mathbf{X}$ along $\mathbf{p}_i$.


_Note that it is generally assumed that the mean of each measurement type has been substracted off before this procedure is performed._



### Singular Value Decomposition

The alternative method is that of singular value decomposition (SVD).  

According to Numerical Recipes, SVD is the method of choice for solving linear least-squares problems.  As we have shown above, PCA can be thought of as a solution of a minimum least-squares problem to minimize the distortion $J$.  

In addition, singular value decomposition is a theorem of linear algebra, that any matrix can be represented by a rotation, then a stretch, then a rotation (see below).  Moreover, singular value decomposition will work in situations when the data matrix is less than ideal.

- Let $\mathbf{X}$ be an arbitrary $n \times m$ matrix (note that the rows and columns have been switched this time as a matter of convention) and $\mathbf{X}^T \mathbf{X}$ be a rank $r$ square symmetric $m \times m$ matrix.

- Let $\{\mathbf{\hat{v}}_1,\mathbf{\hat{v}}_2,\ldots\,\mathbf{\hat{v}}_r\}$ be the set of orthonormal $m \times 1$ eigenvectors with associated eigenvalues $\{\lambda_1, \lambda_2, \ldots ,\lambda_r\}$ for the symmetric matrix $\mathbf{X}^T \mathbf{X}$, i.e.,

$$(\mathbf{X}^T \mathbf{X}) \mathbf{\hat{v}}_i = \lambda_i
\mathbf{\hat{v}}_i.$$

- Let $\sigma_i=\sqrt{\lambda_i}$. These are positive and are called singular values.

- Let $\{\mathbf{\hat{u}}_1,\mathbf{\hat{u}}_2,\ldots,\mathbf{\hat{u}}_r\}$ be the set of $n\times 1 $ vectors defined by $\mathbf{\hat{u}}_i = \frac{1}{\sigma_i} \mathbf{X} \mathbf{\hat{v}}_i$.

Note that

$$\mathbf{\hat{u}}_i \centerdot \mathbf{\hat{u}}_j = \begin{cases} 1 &
\text{if $i=j$}, \\ 0 & \text{otherwise} \end{cases}$$

and the magnitude of $\mathbf{X} \mathbf{\hat{v}}_i$ is

$$\lVert \mathbf{X} \mathbf{\hat{v}}_i \rVert = \sigma_i. $$



We then write

$$\mathbf{X} \mathbf{\hat{v}}_i = \sigma_i \mathbf{\hat{u}_i}.$$

This result says that $\mathbf{X}$, the data matrix, when multiplied by an eigenvector of $\mathbf{X} \mathbf{X}^T$ is equal to a scalar (that is related to the eigenvalue) times another vector. Moreover, both the eigenvectors $\{\mathbf{\hat{v}}_1,\mathbf{\hat{v}}_2,\ldots\,\mathbf{\hat{v}}_r\}$ and the vectors $\{\mathbf{\hat{u}}_1,\mathbf{\hat{u}}_2,\ldots,\mathbf{\hat{u}}_r\}$ are orthonormal sets or bases in the $r$-dimensional space.



Next, define a new diagonal matrix $\boldsymbol{\Sigma}$ which has the rank-ordered singular values (i.e., the square roots of the eigenvalues) along the diagonal.

Consider also orthogonal matrices:

$$\mathbf{V}=[\mathbf{\hat{v}}_{\tilde{1}}\, \mathbf{\hat{v}}_{\tilde{2}}\,\ldots\,\mathbf{\hat{v}}_{\tilde{m}}]$$
$$\mathbf{U}=[\mathbf{\hat{u}}_{\tilde{1}}\,\mathbf{\hat{u}}_{\tilde{2}}\,\ldots\,\mathbf{\hat{u}}_{\tilde{n}}]$$

Here, the matrices have been filled in by an additional $(m-r)$ and $(n-r)$ orthonormal vectors.   See Figure 4 of the PCA tutorial mentioned above for an illustration of how these matrices are constructed.  

Then, the matrix version of the singular value decomposition (SVD) is

$$\mathbf{X} \mathbf{V} = \mathbf{U} \boldsymbol{\Sigma}$$

where each column of $\mathbf{V}$ and $\mathbf{U}$ operate like the equation above, $\mathbf{X} \mathbf{\hat{v}}_i = \sigma_i \mathbf{\hat{u}_i}.$ $\mathbf{V}$ is orthogonal, so multiplying both sides by $\mathbf{V}^{-1} = \mathbf{V}^T$ yields:

$$\mathbf{X}=\mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^T.$$

This equation states that any arbitrary matrix $\mathbf{X}$ can be converted to an orthogonal matrix times a diagonal matrix times another orthogonal matrix (i.e., a rotation, a stretch, and another rotation).  This is apparently a theorem of linear algebra, that any matrix can be written this way.

What does this have to do with PCA? Let us return to the original $m\times n$ data matrix $\mathbf{X}$. We can define a new matrix $\mathbf{Y}$, an $n\times m$ matrix:

$$\mathbf{Y}\equiv \frac{1}{\sqrt{n}} \mathbf{X}^T$$

where each column of $\mathbf{Y}$ has zero mean. This form of $\mathbf{Y}$ is chosen since it is easy to show that

$$\mathbf{Y}^T \mathbf{Y} = \frac{1}{n} \mathbf{X} \mathbf{X}^T = \mathbf{C}_X$$

i.e., the covariance matrix of $\mathbf{X}$. We know that the principal components of $\mathbf{X}$ are the eigenvectors of $\mathbf{C}_X$. If we compute the SVD of $\mathbf{Y}$, then the columns of matrix $\mathbf{V}$ contain the eigenvectors of $\mathbf{Y}^T \mathbf{Y} = \mathbf{C}_X$. We know this because of the definition of SVD, i.e., $\mathbf{X} \mathbf{V} = \mathbf{U} \boldsymbol{\Sigma}$, i.e., for each column $\mathbf{X} \mathbf{\hat{v}}_i = \sigma_i \mathbf{\hat{u}}_i$.

The advantage of SVD seems to be that, because of the theorem of linear algebra, this deconvolution can always be done, no matter how singular the matrix.  Moreover, routines to do this are relatively readily available, e.g., there is one with Numerical Recipes.  In addition, the stuff that you may be interested in getting out of PCA is present in the results of SVD, ready to be harvested. That is, the interpretations of the SVD results are as follows (see this [link](https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca) for  this and more info).

Let the singular value decomposion of the data matrix $\mathbf{X}$ be:

$$\mathbf{X} = \mathbf{U} \mathbf{S} \mathbf{V}^T.$$

Then various results can be inferred.

1. The columns of $\mathbf{V}$ are the principal directions/axes.
2. The columns of $\mathbf{US}$ are principal components, also called scores. From $\mathbf{X} \mathbf{v}_i = \sigma_i \mathbf{u}_i$, they seem to be the projection of the data matrix on its eigenvector.
3. To reduce dimensionality from $p$ to $k<p$, select the $k$ first columns of $\mathbf{U}$ and the $k \times k$ upper left part of $\mathbf{S}$. Their product $\mathbf{U}_k \mathbf{S}_k$ is the required $n \times k$ matrix containing the first $k$ principal components.
4. Then, if you multiply the first $k$ PCs by the corresponding principal directions $\mathbf{V}_k^T$ yields $\mathbf{X}_k = \mathbf{U}_k \mathbf{S}_k \mathbf{V}_k^T$ that has the original $n\times p$ size, but is of lower rank. So this matrix $\mathbf{X}_k$ is the reconstruction of the original data from the first $k$ PCs.

The advantages of SVD also seem to be computational, in that one doesn't have to compute the SVD for whole problem, but rather one can compute the first few SVDs.  This is my impression.



## PCA for High Dimensional Data

In some applications, the number of data points is smaller than the dimensionality of the data space. For example, one might have a few hundred images (the samples) but each image may have several million dimensions, corresponding to the pixels in the images. Let the dimension of the space equal $D$ and the number of samples equal $N$, with $N < D$.

If we try to perform PCA in the $D$ dimensional space, many of the eigenvalues will be zero, corresponding to eigenvectors along directions where the data has zero variance.

So consider $\mathbf{X}$ to be the $N\times D$ data matrix. The covariance matrix is $\mathbf{S} = N^{-1} \mathbf{X}^T \mathbf{X}$, which is a $D \times D$ matrix. The eigenvector equation is

$$\frac{1}{N} \mathbf{X}^T \mathbf{X} \mathbf{u}_i = \lambda_i \mathbf{u}_i.$$

Now, multiply both sides by $\mathbf{X}$:

$$\frac{1}{N} \mathbf{X} \mathbf{X}^T (\mathbf{X}\mathbf{u}_i) = \lambda_i (\mathbf{X} \mathbf{u}_i).$$

Define $\mathbf{v}_i = \mathbf{X} \mathbf{u}_i$, and substitute it in:

$$\frac{1}{N} \mathbf{X} \mathbf{X}^T \mathbf{v}_i = \lambda_i \mathbf{v}_i$$

which the eigenvector equation for the $N \times N$ matrix $N^{-1} \mathbf{X} \mathbf{X}^T$. This will have the same eigenvalues as the original covariance matrix (which will have the additional $D-N+1$ eigenvalues with a value of zero). So this means that the eigenvector problem can be solved in the lower-dimensionality space with cost $O(N^3)$ rather than $O(D^3)$.

To get the eigenvectors, multiply both sides by $\mathbf{X}^T$ to yield:

$$\left ( \frac{1}{N} \mathbf{X}^T \mathbf{X} \right ) (\mathbf{X^T}
\mathbf{v}_i) = \lambda_i (\mathbf{X}^T \mathbf{v}_i) $$

which shows that $(\mathbf{X}^T \mathbf{v}_i)$ is an eigenvector of $\mathbf{S}$ with eigenvalue $\lambda_i$. These eigenvectors will not be normalized. Define $\mathbf{u}_i$ so that $\lVert \mathbf{u}_i \rVert = 1$, and assuming that $\mathbf{v}_i$ have been normalized, yields

$$\mathbf{u}_i = \frac{1}{(N\lambda_i ) ^{1/2}} \mathbf{X}^T \mathbf{v}_i.$$

To summarize, a $N \times D$ type of problem can be performed by evaluating $\mathbf{X} \mathbf{X}^T$ (which is $N \times N$), finding its eigenvalues and eigenvectors, and then use the last equation to recast in the original data space.



## What about Errors?

All measurements have errors, but there has been no discussion yet on the influence of errors on the principal components. 

Likewise, what can be done if there is missing data? 

That is, say for example you are doing spectral PCA, and there is a region of the spectrum which is bad due to bad pixels or a bad cosmic ray hit. Do you have to throw out the whole spectrum?  

In many cases, this may not be important, but one can imagine a situation where you have, say, a time series of spectra, each with equal exposure, where you suspect that the PC might probe something having to do with variations in high or low states.  The lower flux spectra would have larger error bars, and you'd want to be able to take that fact into account in order to not get skewed PCs.

### Bailey 2012

I found a paper, [Bailey 2012](http://adsabs.harvard.edu/abs/2012PASP..124.1015B), that addresses this problem in a very clear way, and, in addition, will lend further insight to PCA and its calculation.  Note that Ivezic has a look at "PCA with Missing Data" in section 7.3.3, but it seems to me that they tackle a different problem: reconstruction of data that has missing values after the eigenvectors have been constructed from a training set.  

Bailey first lays out the equations needed, and then presents an expectation-maximization method for solving them.


In Section 3, he introduces the classical PCA, i.e., without noise, but from a bit of a different point of view. Consider the principal components $\{ \boldsymbol{\phi_k}\}$ of a data set, i.e., the eigenvectors of the covariance of that data set that have been sorted by their descending eigenvalues. One may then express a new observation $\mathbf{y}$ as

$$\mathbf{y} = \boldsymbol{\mu} + \sum_k c_k \boldsymbol{\phi}_k,$$

where $\boldsymbol{\mu}$ is the mean of the initial data set, and $c_i$ is the reconstruction coefficient for that eigenvector $\boldsymbol{\phi}_i$. It is easier if the mean has been subtracted off from the beginning (it is basically always assumed that you have subtracted the mean), so henceforth, it will be assumed that $\mathbf{y}$ has has the mean subtracted.

To find a particular coefficient $c_{k^\prime}$, take the dot product of both sides with $\boldsymbol{\phi}_{k^\prime}$. Since the eigenvectors are orthogonal,


$$\mathbf{y}\centerdot \boldsymbol{\phi}_{k^\prime}= \sum_k c_k \boldsymbol{\phi}_k \centerdot
\boldsymbol{\phi}_{k^\prime} $$
$$ = \sum_k c_k \delta_{kk^\prime} $$
$$ = c_{k^\prime} $$


noting that $\delta_{kk^\prime}$ is the Kroneker delta.

Written this way, we can cast the problem as solving the following minimization problem:

$$\chi^2 = \sum_{ij} [\mathbf{X}-\mathbf{P}\mathbf{C}]^2_{ij},$$

where 
- $\mathbf{X}$ is the data set matrix whose columns are observations and rows are variables
- $\mathbf{P}$ is a matrix whose columns are the principal components $\{\boldsymbol{\phi}_k\}$ to be found
- $\mathbf{C}$ is the matrix of coefficients to fit $\mathbf{X}$ using $\mathbf{P}$.   

This seems somewhat similar to $J$ that we considered above, but it is different in that now all the variables have been summed over.  

For clarity, Bailey notes that the dimensions of the matrices are: $\mathbf{X}$: $[n_{var},n_{obs}]$, $\mathbf{P}$: $[n_{var},n_{vec}]$, and $\mathbf{C}$: $[n_{vec},n_{obs}]$, where $n_{obs}$, $n_{var}$, and $n_{vec}$ are the number of observations, variables, and eigenvectors, respectively. Implicitly, $n_{vec}$ can be smaller than the total number of possible eigenvectors, to effect dimensionality reduction.

In Section 4, Bailey adds weights matrices, which take care of the missing data and uncertainties.  The structure of the resulting $\chi^2$ equations changes with the increasing complexity and generality of the weights specification:

First: 

$$\chi^2 = \sum_{ij} \mathbf{W}_{ij} [\mathbf{X}-\mathbf{P}\mathbf{C}]^2_{ij},$$

where $\mathbf{W}$ is the matrix of weights on the dataset $\mathbf{X}$. This looks quite similar to the usual definition of $\chi^2$.

Looking at the individual per-observation case (i.e., the weights depend on the observation, not just on the individual data points), we use the covariances $\mathbf{V}_j$:

$$\chi^2=\sum_{\text{obs}j} (\boldsymbol{\mathit{X}}^\text{col}_j - \mathbf{P} \boldsymbol{\mathit{C}}^\text{col}_j)^T \mathbf{V}^{-1}_j (\boldsymbol{\mathit{X}}^\text{col}_j-\mathbf{P} \boldsymbol{\mathit{C}}^\text{col}_j),$$

where $\boldsymbol{\mathit{X}}^\text{col}_j$ is the vector formed from the $j$th column of the matrix $\mathbf{X}$, and $\boldsymbol{\mathit{C}}^\text{col}_j$ is similarly described.

The even more general case, where each point has its own variance, can be written as:

$$ \chi^2=(\boldsymbol{\mathit{X}}- [\mathbf{P}]\boldsymbol{\mathit{C}})^T \mathbf{V}^{-1} (\boldsymbol{\mathit{X}} - [\mathbf{P}] \boldsymbol{\mathit{C}}), $$

where $\boldsymbol{\mathit{X}}$ and $\boldsymbol{\mathit{C}}$ are vectors formed by concatenating all of the columns of $\mathbf{X}$ and $\mathbf{C}$, and $[\mathbf{P}]$ is the matrix formed by stacking $\mathbf{P}$ $n_{obs}$ times.  See Bailey 2012 for details.

## EMPCA

Section 5 of the paper explains how to minimize the nasty equations above. Basically, they do it using the expectation-maximization method. Recall that EM is an iterative technique for solving for parameters that maximize a likelihood function for models with hidden or latent variables.  Also note that EMPCA is discussed by Bishop.

As applied to PCA, the parameters to solve for are the eigenvectors, the latent variables are the coefficients $\{c\}$ for fitting the data using those eigenvectors, and the likelihood is the ability of the eigenvectors to describe the data. Recall that each iteration involves two steps - finding the expectation value of the hidden variables given the current model (E-step), then modifying the fit likelihood given the estimates of the hidden variables (M-step).

So, to solve for the single most significant eigenvector, start with a random vector of length $n_{var}$ (i.e., it has $n_{var}$ components). For each observation $\mathbf{x}_j$, solve for the coefficient $c_j = \mathbf{x}_j \centerdot \boldsymbol{\phi}$ that best fits that observation. Using those coefficients, update $\boldsymbol{\phi}$ i.e.,

$$\boldsymbol{\phi}_{new} = \sum_j c_j \mathbf{x}_j / \sum_j c_j^2$$

Then, normalize $\boldsymbol{\phi}$ to unit length, and interate the solutions to $\{c\}$ and $\boldsymbol{\phi}$ until converged.

Summarizing these steps:

1. Choose $\boldsymbol{\phi}$, a random vector of length $n_{var}$.

2. Repeat the following until converged:

    (a) For each observation $\mathbf{x}_j$: $c_j = \mathbf{x}_j
\centerdot \boldsymbol{\phi}$. (E-step)  
    (b) $\boldsymbol{\phi}_{new} = \sum_j c_j \mathbf{x}_j / \sum_j c_j^2$ (M-step;
summing over observations)  
    (c) $\boldsymbol{\phi}_{new,norm}=\boldsymbol{\phi}_{new}/|\boldsymbol{\phi}_{new}|$  

This will generate a vector $\boldsymbol{\phi}$ which is the dominant PCA eigenvector of dataset $\mathbf{X}$, where the observations $\mathbf{x}_j$ are the columns of $\mathbf{X}$. In the long run, it minimizes

$$\chi^2 = \sum_{\text{var}i,\text{obs}j }(\mathbf{X}_{ij} - c_j \boldsymbol{\phi}_i)^2.$$

To find subsequent eigenvectors, subtract the projection of $\boldsymbol{\phi}$ from $\mathbf{X}$ and repeat. Continue until enough eigenvectors have been obtained until the remaining variance is consistent with the expected noise of the data, or until there are sufficient eigenvectors to reproduce the data with the desired accuracy. If only a few eigenvectors are needed for a large data set, this method can be much faster than classical PCA.



#### EMPCA with per-Observation Weights

Next add the weights. First, we will weight per observation, e.g., a spectrum observed under marginal conditions may be weighted less. First, the observations $\mathbf{X}$ should have the weighted mean subtracted. Next, the M-step would be replaced with

$$\boldsymbol{\phi}_{new} = \sum_j w_j c_j \mathbf{x}_j.$$



#### EMPCA with per-Variable Weights

Weights per variable might be appropriate when there are not only low signal-to-noise observations, but also missing data (for example). This is a much more complicated situation, because a dot product can no longer be used to derive $\{ c_j \}$. Rather, a set of linear equations must be solved. Similarly, the likelihood maximization step must solve a set of linear equations to update $\boldsymbol{\phi}$ rather than performing a sum.

So the weighted EMPCA starts with a set of random orthonormal vectors $\{ \boldsymbol{\phi}_k\}$ and iterates through the following steps. Note that both of these steps should be performed with the appropriate weights on $\mathbf{x}_j$.

1. For each observation $\mathbf{x}_j$ solve for the coefficients $c_{kj}$:

   $$\mathbf{x}_j = \sum_k c_{kj}\boldsymbol{\phi}_k.$$

2. For the given $\{ c_{kj} \}$, solve each $\boldsymbol{\phi}_k$ one-by-one for $k$ in $\{1,\ldots,n_{vec}\}$:

   $$\mathbf{x}_j - \sum_{k^\prime < k} c_{k^\prime j} \boldsymbol{\phi}_{k^\prime} = c_{kj} \boldsymbol{\phi}_k.$$

In the first part, the $\boldsymbol{\phi}_k$ vectors are fixed and the coefficients are solved for with separate equations for each observation $\mathbf{x}_j$. In other words, $\mathbf{X} = \mathbf{P}\mathbf{C}$,  and each independent observation column indexed by $j$ can be written as:

$$\boldsymbol{\mathit{X}}^{\text{col}}_j = \mathbf{P} \boldsymbol{\mathit{C}}^{\text{col}}_j + \text{noise}.$$

See Bailey Figure 1 for a schematic.

This equation is solved for $\boldsymbol{\mathit{C}}^{\text{col}}_j$ with noise weighting by measurement covariance $\mathbf{V}_j$. It is a linear least-squares problem which can be solved with singular value decomposition, for example.

If the noise is independent between variables, the inverse covariance $\mathbf{V}^{-1}$ is just a diagonal matrix of weights $\mathbf{W}^{\text{col}}_j$. Note that the covariance discussed here is the measurement covariance, not the covariance of the data set.

In principal, there could be measurement covariance between different observations, in which case one can solve $\boldsymbol{\mathit{X}} = [\mathbf{P}] \boldsymbol{\mathit{C}}$ with the full measurement covariance matrix $\mathbf{V}$, and remembering that $\boldsymbol{\mathit{X}}$ stands for all the vectors concatenated, and $[\mathbf{P}]$ is formed by stacking $\mathbf{P}$ $n_{obs}$ times. For the second part, one uses the coefficients of $\mathbf{C}$ obtained in the first part, and then solves for the eigenvectors $\mathbf{P}$. They are solved for one by one. Recalling that $k$ indexes the eigenvector, then

$$\mathbf{X} = \boldsymbol{\mathit{P}}^{\text{col}}_k \otimes \boldsymbol{\mathit{C}}^{\text{col}}_k$$

where $\otimes$ is the outer product. If the variables (rows) of $\mathbf{X}$ are independent, then one can solve for a single element of $\boldsymbol{\mathit{P}}^{\text{col}}_k$ at a time:

$$\boldsymbol{\mathit{X}}^{\text{row}}_i = \mathbf{P}_{ik} \boldsymbol{\mathit{C}}^{\text{row}}_k.$$

See Figure 2.

With independent weights $\mathbf{W}^{\text{row}}_i$ on the data $\boldsymbol{\mathit{X}}^{\text{row}}_i$, the $i$th component of the $k$th eigenvector can be solved for with:

$$\mathbf{P}_{ik}= \frac{\sum_j\mathbf{W}_{ij} \mathbf{X}_{ij}
\mathbf{C}_{ik}}{\sum_j \mathbf{W}_{ij} \mathbf{C}_{ik}
\mathbf{C}_{ik}}.$$

If there are measurement covariances between the data, one can solve for all elements of $\boldsymbol{\mathit{P}}^{\text{col}}_k$ simultaneously, using the full measurement covariance matrix $\mathbf{V}$.

After solving for $\boldsymbol{\mathit{P}}^{\text{col}}_k$, subtract its projection from the data:

$$\mathbf{X}_{new} = \mathbf{X} - \boldsymbol{\mathit{P}}^{\text{col}}_k \otimes \boldsymbol{\mathit{C}}^{\text{row}}_k$$

to remove any variation of the data in the direction of $\boldsymbol{\mathit{P}}^{\text{col}}_k$ so that additional eigenvectors will be orthogonal to the prior ones. Then solve for the next eigenvector $k+1$.

Bailey then applies this method to some simulated data, first a toy model consisting of sines, then some QSO data.

Another implementation of these ideas has been recently published by Delchambre (2015).

## EMPCA Example

I use the Ivezic data set spec4000_corrected.npz to illustrate EMPCA.  This is a sample of spectra of different types of objects all shifted to the rest frame and sampled on the same wavelength range.  The data does not come with errors, so I dummied up some weights using approximate spectrograph efficiency.  Some of the spectra have fluxes less than zero, and the weights in those bins are set to zero.

**Note** The data set was too big to upload to github.  So please obtain it from [here](http://www.nhn.ou.edu/~leighly/spec4000_corrected_weights).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pylab, mlab, pyplot
from scipy import stats
%matplotlib inline
import pickle

In [None]:
file=open('spec4000_corrected_weights','r')
temp=pickle.load(file)

In [None]:
flux=temp[0]
weights=temp[1]
wavelengths=temp[2]
labels=temp[3]
index=temp[4]

In [None]:
print flux.shape
print weights.shape
print wavelengths.shape
print labels
print index.shape

The fluxes are the observed fluxes, which have a wide range, e.g., depending on how far awa y the object is.  So we should normalize the data.  Here, I normalize by dividing by the integrated flux.  Other ways to normalize might be to normalize to the flux at a particular wavelength range.

In [None]:
import scipy.interpolate
from scipy.interpolate import interp1d

def GetWeightedMean(data,weights):
    meanout=np.zeros(data.shape[1])
    clustsize=data.shape[0]
    for i in range(0,data.shape[1]):
        datatemp=data[0:clustsize,i]
        weightstemp=weights[0:clustsize,i]
        meanout[i]=(np.sum(datatemp*weightstemp))/(np.sum(weightstemp))
    return meanout

def SubtractWeightedMean(data,mean):
    dataout=np.zeros([data.shape[0],data.shape[1]])
    clustsize=data.shape[0]
    meansize=mean.shape[0]
    for i in range(0,clustsize):
        dataout[i,0:meansize]=data[i,0:meansize]-mean[0:meansize]
    return dataout

def NormalizeSpectra(data,weights,wavelengths):
    datanorm=np.zeros([data.shape[0],data.shape[1]])
    weightsnorm=np.zeros([data.shape[0],data.shape[1]])
    numspec=data.shape[0]
    for i in range(0,numspec):
        factor=scipy.integrate.trapz(data[i,:],wavelengths)
        datanorm[i,:]=data[i,:]/(factor/1000.0)
        weightsnorm[i,:]=weights[i,:]/(factor/1000.0)
    return datanorm,weightsnorm

In [None]:
datanorm,weightsnorm = NormalizeSpectra(flux,weights,wavelengths)

Let's analyze just the absorption line galaxies.  Those are indexed by the value "2".

In [None]:
data2=datanorm[index==2]
weights2=weightsnorm[index==2]
print data2.shape

You don't necessarily have to subtract the mean, but I ususally do because otherwise the variance will be dominated by the differences from the mean.

In [None]:
pylab.rcParams['figure.figsize'] = (15, 6)
data2_mean=GetWeightedMean(data2,weights2)
print data2_mean.shape
plt.plot(wavelengths,data2_mean)

In [None]:
data2_sub=SubtractWeightedMean(data2,data2_mean)
plt.plot(wavelengths,data2_sub[0,:])
plt.plot(wavelengths,data2_sub[10,:])
plt.plot(wavelengths,data2_sub[20,:])
plt.plot(wavelengths,data2_sub[30,:])
plt.plot(wavelengths,data2_sub[40,:])
plt.plot(wavelengths,data2_sub[50,:])

In [None]:
from empca import empca
m=empca(data2_sub, weights=weights2, niter=25, nvec=25, smooth=0, randseed=1, silent=False)

fractional=np.zeros(25)
for i in range(0,25):
    fractional[i]=m.R2vec(i)

In [None]:
print fractional

print fractional[0:4].sum()

plt.plot(fractional)

EMPCA doesn't give the eigenvalues; instead, it outputs the fraction of the variance that is accounted for by each eigenvector.  So, for the above, the first eigenvector accounts for 47% of the variance, the second one for 12.5% of the variance, the third one 9.4%, and so on. The first four eigenvectors account for 72% of the variance.  The rest doesn't seem to have any clearly discernible pattern.

In [None]:
plt.plot(wavelengths,m.eigvec[0])

In [None]:
plt.plot(wavelengths,m.eigvec[1])

In [None]:
plt.plot(wavelengths,m.eigvec[2])

In [None]:
plt.plot(wavelengths,m.eigvec[3])

The coefficients can be examined.  As you will see, the dispersion of the coefficients reflects the fraction of the variance accounted for.  

In [None]:
plt.plot(m.coeff[:,0],m.coeff[:,1],'.')

EMPCA offers the ability to compute classic PCA, so we can see how well the two do for reconstructing a spectrum.

In [None]:
from empca import classic_pca

m2=classic_pca(data2_sub)

In [None]:
set_index=500

plt.plot(wavelengths,data2[set_index,:])

In [None]:
#comparison with the mean

plt.plot(wavelengths,data2[set_index,:])
plt.plot(wavelengths,data2_mean)

In [None]:
#the mean plus one eigenvector

plt.plot(wavelengths,data2[set_index,:])
plt.plot(wavelengths,data2_mean+m.coeff[set_index,0]*m.eigvec[0,:])

In [None]:
#the mean plus two eigenvectors

plt.plot(wavelengths,data2[set_index,:])
plt.plot(wavelengths,data2_mean+m.coeff[set_index,0]*m.eigvec[0,:]+m.coeff[set_index,1]*m.eigvec[1,:])


In [None]:
#the mean plus three eigenvectors

plt.plot(wavelengths,data2[set_index,:])
plt.plot(wavelengths,data2_mean+m.coeff[set_index,0]*m.eigvec[0,:]+m.coeff[set_index,1]*m.eigvec[1,:]+m.coeff[set_index,2]*m.eigvec[2,:])

In [None]:
#the mean plus 4 eigenvectors.

plt.plot(wavelengths,data2[set_index,:])
plt.plot(wavelengths,data2_mean+m.coeff[set_index,0]*m.eigvec[0,:]+m.coeff[set_index,1]*m.eigvec[1,:]+ \
         m.coeff[set_index,2]*m.eigvec[2,:]+m.coeff[set_index,2]*m.eigvec[3,:])

In [None]:
plt.plot(wavelengths,data2[set_index,:])
plt.plot(wavelengths,data2_mean+m.coeff[set_index,0]*m.eigvec[0,:]+m.coeff[set_index,1]*m.eigvec[1,:]+ \
         m.coeff[set_index,2]*m.eigvec[2,:]+m.coeff[set_index,2]*m.eigvec[3,:])
plt.plot(wavelengths,data2_mean+m2.coeff[set_index,0]*m2.eigvec[0,:]+m2.coeff[set_index,1]*m2.eigvec[1,:]+ \
         m2.coeff[set_index,2]*m2.eigvec[2,:]+m2.coeff[set_index,2]*m2.eigvec[3,:])

In [None]:
plt.plot(wavelengths,data2[set_index,:]-(data2_mean+m.coeff[set_index,0]*m.eigvec[0,:]+m.coeff[set_index,1]*m.eigvec[1,:]+ \
         m.coeff[set_index,2]*m.eigvec[2,:]+m.coeff[set_index,2]*m.eigvec[3,:]),'b')
plt.plot(wavelengths,data2[set_index,:]-(data2_mean+m2.coeff[set_index,0]*m2.eigvec[0,:]+m2.coeff[set_index,1]*m2.eigvec[1,:]+ \
         m2.coeff[set_index,2]*m2.eigvec[2,:]+m2.coeff[set_index,2]*m2.eigvec[3,:]),'r')

The classic PCA result ends up being a bit noisier for the same number of eigenvectors.

## How Many Components Should You Keep?

If you are using PCA for dimensionality reduction, you must decide how many principal components to keep. It turns out that there is no hard and fast rule for this; rather, there are some rules of thumb and other guidelines, and we will investigate them. Possibly a rigorous approach to some problems might come through a Bayesian PCA, using a model selection approach, but it probably depends on the problem and what you are trying to do.

This section references Ivezic section 7.3, Bishop Chapter 12, and Jolliffe Chapter 6. It is worth noting that Jolliffe does a thorough review of the literature; nearly 10% of his book is devoted to references.



### Rules of Thumb

The question that we are addressing is: how many principal components adequately account for the total variation in $\mathbf{X}$. That is, for dimensionality reduction, we desire to replace the $D$ elements of $\mathbf{X}$ with $M$ principal components, which nevertheless discard very little information. For example, if we want to model spectra using principal components, then using too few principal components will not model features adequately, and using too many will contribute to noise in the reconstructed spectrum. An example is seen in Ivezic 7.6 (below).

![Ivezic fig 7.6](http://www.astroml.org/_images/fig_spec_reconstruction_1.png)


### Fraction of Total Variance:  

Perhaps the most obvious criterion is to select the percentage of the total variation which one desires that the selected PCs to contribute, say 70% to 90%. Recall that the sum of the eigenvalues equals the total variance of the data set. So, for example, to model 90% of the variance, include principal components until the sum of the associated eigenvalues is 0.9 times the total variance. Whether you use 70% or 90% or another value depends on the problem and what features of the data you need to retain.

It is worth emphasizing that one might want to pre-process the data of this test is intended by subtracting the mean (which is always assumed to have been done) but also normalize. For example, for PCA of galaxy spectra, one wants to normalize all the spectra so that the integrated flux (or perhaps flux at a particular point) is the same. Otherwise, the first eigenvector will be dominated by the trivial property that some objects are brighter than others because they are simply closer.

### Size of Variance of Principal Components:  

This rule applies when the correlation matrix is used rather than the covariance matrix. The idea is that if all elements of $\mathbf{X}$ are independent, then the principal components are the same as the original variables and all will have unit variances in the case of a correlation matrix. So, in a real case where not all points are independent, some principal components will have variances greater than one, and some will have variances less than one. The rule, called Kaiser's rule, in its simplest form, suggests retaining principal components whose variances exceed 1.

Another case might be if the data set contains groups of variables with large within-group correlations, and small between-group correlations. This method would save only one principal component associated with each group.

The discussion in Jolliffe points out that some people think that keeping principal components that are greater than 1 retains too few variables. Some simulation-based studies suggest that the cutoff should be more like 0.7.

The rule can be generalized to covariance matrix analysis by using the average value of the eigenvalues as the cutoff, or 0.7 times the average value.

### Scree Graph:

The scree graph simply plots the eigenvalues as a function of eigenvalue number (remembering that we conventionally write these in decreasing order). An example of a scree plot is seen in Ivezic figure 7.5.

![Ivezic 7.5](http://www.astroml.org/_images/fig_eigenvalues_1.png)


The idea of the scree plot is that, in the region where the eigenvalues are falling rapidly, a lot of variance is being modeled by each successive principal component, so those principal components should be kept. When the scree plot becomes flat, each successive principal component is not contributing much to modeling the total variance, and those principal components can be dropped. Note, however, the scree plot may not look as nice as the one shown above. As an alternative, the log of the eigenvalue can be plotted.

There are other ad hoc rules; see Jolliffe for discussion.

### Cross Validation

Cross validation offers a more rigorous yet computationally intensive method to determine how many principal components to keep. Note, however, there is still no formal significance test with this method either. Cross validation is discussed in more detail in the next lecture, so we will keep this discussion basic.

The idea is to model a spectrum using an increasing number of principal components, then stop when the overall prediction of the $x_{ij}$ is no longer significantly improved by the addition of extra terms.

If we express the singular value decomposition as $\mathbf{X} = \mathbf{U} \mathbf{L} \mathbf{A}^\prime$ (slightly different notation is used in Jolliffe), where $\mathbf{X}$ is $n\times p$, $\mathbf{U}$ is $n\times r$, and $\mathbf{A}$ is $p \times r$. Also remember that $\mathbf{L}$ is the diagonal matrix with the elements $l_1^{1/2},l_2^{1/2},\ldots,l_r^{1/2}$, i.e., the $l_k$ are eigenvalues of $\mathbf{X}^\prime \mathbf{X}$, then $x_{ij}$ can be written as:

$$x_{ij} = \sum_{k=1}^r u_{ik} l_k^{1/2} a_{jk},$$

where $r$ is the rank of $\mathbf{X}$. Then the estimate of $x_{ij}$, based on the first $m$ PCs (and using all the data) is:

$$_m\tilde{x}_{ij} =\sum_{k=1}^m u_{ik} l_k^{1/2} a_{jk}.$$

For cross validation, we need the estimate based on the PCA of the data set that lacks $x_{ij}$. That estimate is written:

$$_m\tilde{x}_{ij} =\sum_{k=1}^m \hat{u}_{ik} \hat{l}_k^{1/2}
\hat{a}_{jk}.$$

To clarify, $\hat{u}_{ik}$, $\hat{l}_k$, and $\hat{a}_{jk}$ are derived from the singular value decomposition perform from the subset of $\mathbf{X}$ that lacks $x_{ij}$.

Define

$$\text{PRESS}(m) = \sum_{i=1}^{n} \sum_{j=1}^{p} (_m \hat{x}_{ij} - x_{ij})^2.$$

The notation PRESS stands for PREdiction Sum of Squares, but it makes sense, looking at the square difference between the model and the data.

How PRESS is used differs on the statistian. One person suggests using

$$R=\frac{\text{PRESS}(m)}{\sum_{i=1}^n \sum_{j=1}^p
(_{(m-1)}\tilde{x}_{ij} - x_{ij})^2}.$$

$R$ compares the prediction error sum of squares (top) obtained after fitting $m$ components with the sum of squared differences between observed and estimated data points based all the data but using $m-1$ components. If $R < 1$, then the implication is that a better prediction is achieved if $m$ rather than $m-1$ PCs are used, so the $m$th PC should be kept.

Another statistian suggests computation of $W$, where $W$ is defined
as:

$$W= \frac{[\text{PRESS}(m-1) -
\text{PRESS}(m)]/\nu_{m,1}}{\text{PRESS} (m)/\nu_{m,2}},$$

where $\nu_{m,1}$ and $\nu_{m,2}$ are the number of degrees of freedom associated with the numerator and denominator, respectively. This looks a little like an f-test. Again, if $W>1$ for some $m$, that  principal component should be included.

As you can imagine, this type of cross-validation is extremely expensive computationally, since new sets of eigenvectors have to be computed for each data matrix element. See Jolliffe for further details and examples.

