### Principal Component Analysis (PCA)
Consider a sample $X_1, \dots, X_n$ which forms a set of points in $\mathbb{R}^d$.

Is it possible to project this set onto a linear subspace of dimension $d' < d$ while retaining as much information as possible?

PCA achieves this by preserving as much covariance structure as possible, by identifying orthogonal directions that best discriminate the points of the set.

- The Idea, Write $S = PDP^T$, where:

- $P = (v_1, \dots, v_d)$ is an orthogonal matrix, i.e., $\|v_j\|_2 = 1$ and $v_j^T v_k = 0$ for $j \neq k$.
- $D = \text{diag}(\lambda_1, \dots, \lambda_d)$, with $\lambda_1 \geq \dots \geq \lambda_d \geq 0$.

- Note that $D$ is the empirical covariance matrix of $P^T X_i$'s, for $i = 1, \dots, n$.
- In particular, $\lambda_1$ is the empirical variance of $v_1^T X_i$, and $\lambda_d$ is the empirical variance of $v_d^T X_i$.
- Each $\lambda_j$ measures the spread of the set in the direction $v_j$.
- In particular, $v_1$ is the direction of maximal spread.
- Indeed, $v_1$ maximizes the empirical covariance of $a^T X_1, \dots, a^T X_n$, over $a \in \mathbb{R}^d$ such that $\|a\|_2 = 1$.
- Proof: For any unit vector $a$, show that $a^T \Sigma a = (P^T a)^T D (P^T a) \leq \lambda_1$, with equality if $a = v_1$.


- Proof: For any unit vector  $a$ in $\mathbb{R}^d$,
$ a^T \Sigma a = a^T (PDP^T) a = (P^T a)^T D (P^T a) $

  with  equality holds when $a = v_1$.

### Principal Component Analysis Main Principle
Idea of the PCA: find the collection of orthogonal directions in which the set is spread out.

- $v_1 \in \underset{u}{\operatorname{argmax}} u^T \Sigma u \text{subject to } u|| = 1 $

  $v_2 \in \underset{u}{\operatorname{argmax}} u^T \Sigma u \text{subject to } u|| = 1, \quad u \perp v_1 $

  $v_d \in \underset{u}{\operatorname{argmax}} u^T \Sigma u \text{subject to }
u|| = 1, u \perp v_j, \quad j = 1, \dots, d-1$
  where $\Sigma$ covariance matrix.

- The $k$ orthogonal directions in which the set is most spread out correspond to the eigenvectors associated with the $k$-largest eigenvalues of $\Sigma$. They are called principal directions.

### Principal Component Analysis Algorithm
- Input $X_1, \dots, X_n$, a set of $n$ points in dimension $d$.
- Compute the empirical covariance matrix $S$.
- Compute the spectral decomposition $S = PDP^T$, where $D = \text{diag}(\lambda_1, \dots, \lambda_d)$,

  with $\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_d$, and $P = (v_1, \dots, v_d)$ is an orthogonal matrix.
- Choose $k < d$ and set $P_k = (v_1, \dots, v_k) \in \mathbb{R}^{d \times k}$.
- Output $Y_1, \dots, Y_n$, where $Y_i = P_k^T X_i \in \mathbb{R}^k$, for $i = 1, \dots, n$.


### Applications
PCA in statistics is used for estimation and Machine learning
- In genomics applications for gene expression we use sparse PCA
- It may be known beforehand that Sigma has low rank
- Running PCA on Sigma, $S ~S'$,   $S'=P Sigma P^T$
- S' is a better estimator of S under low-rank assumption


__________________________

### **Appendix**

### Linear Algebra

- $\Sigma$ and $S$ are symmetric, positive semi-definite matrices.
- Any real symmetric matrix $A \in \mathbb{R}^{d \times d}$ has the spectral decomposition $A = PDP^T$
    where:
    - $P$ is a $d \times d$ orthogonal matrix, i.e., $PP^T = P^TP = I_d$.
    - $D$ is a diagonal matrix.


- The diagonal elements of $D$ are the eigenvalues of $A$, and the columns of $P$ are the eigenvectors of $A$.
- $A$ is positive semi-definite if and only if all its eigenvalues are non-negative.

### Multivariate Statistics (1)
- Let $X$ be a $d$-dimensional r.v. and $X_1, \ldots, X_n$ be $n$ independent copies of $X$.

$$X_i = (X_i^{(1)}, \ldots, X_i^{(d)})^T, \quad i = 1, \ldots, n$$

- Denote by $X$ the random $n \times d$ matrix

$$X = \begin{bmatrix}
X_1^T \\
\vdots \\
X_n^T
\end{bmatrix}$$

- Assume $\mathbb{E}[\|X\|_2^2] < \infty$.

- Mean of $X$: $\mathbb{E}[X] = (\mathbb{E}[X^{(1)}], \ldots, \mathbb{E}[X^{(d)}])^T$.

- Covariance matrix of $X$, $\Sigma = (\sigma_{j,k})_{j,k}$, where $j, k = 1, \ldots, d$, and $\sigma_{j,k} = \mathrm{cov}(X^{(j)}, X^{(k)})$.

$$\Sigma = \mathbb{E}[XX^T] - \mathbb{E}[X]\mathbb{E}[X]^T = \mathbb{E}[(X - \mathbb{E}[X])(X - \mathbb{E}[X])^T]$$

- Empirical mean of $X_1, \ldots, X_n$:

$$\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i = (\bar{X}_1, \ldots, \bar{X}_d)^T$$

- Empirical covariance of $X_1, \ldots, X_n$, the matrix $S = (s_{j,k})$, where $s_{j,k}$ is the empirical covariance of $X_i^{(j)}, X_i^{(k)}$, $i = 1, \ldots, n$.

$$S = \frac{1}{n} \sum_{i=1}^n (X_i X_i^T - \bar{X}_n \bar{X}_n^T) = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X}_n)(X_i - \bar{X}_n)^T$$

_______________________

###Multivariate Statistics (2)
- Note that
$\bar{X} = \frac{1}{n} X^T \mathbf{1}_n, \quad \mathbf{1}_n = (1, \dots, 1)^T \in \mathbb{R}^n.$

- Note that $S = \frac{1}{n} XX^T - \frac{1}{n^2} X \mathbf{1}_n \mathbf{1}_n^T X = \frac{1}{n} X^T H X, \quad \text{where} \quad H = I_n - \frac{1}{n} \mathbf{1}_n \mathbf{1}_n^T.$

- $H$ is an orthogonal projector: $H^2 = H$, $H^T = H$.

- If $u \in \mathbb{R}^d$,

    - $u^T \Sigma u = \text{var}(u^T X).$

    - $u^T S u$ is the sample variance of $u^T X_1, \dots, u^T X_n$.

- $u^T S u$ measures how spread diverse the points are in $u$.

- If $u^T S u = 0$, then all $X_i$'s are in an affine subspace orthogonal to $u$.

- If $u^T \Sigma u$ is large with $\|u\|_2 = 1$, then the direction of $u$ explains the spread of the sample.