# Analisis de Componentes Principales

Trabajar directamente con datos de alta dimensión, como ser texto e imágenes, conlleva algunas dificultades: es difícil de analizar, la interpretación es difícil, la visualización es casi imposible y (desde un punto de vista práctico) el almacenamiento de los vectores de datos puede ser muy costos. 

Sin embargo, los datos de alta dimensión a menudo tienen propiedades que podemos aprovechar. Por ejemplo, los datos de alta dimensión suelen estar demasiado completos, es decir, muchas dimensiones son redundantes y pueden explicarse mediante una combinación de otras dimensiones. 

Además, las dimensiones en los datos de alta dimensión a menudo se correlacionan de modo que los datos posean una estructura intrínseca de dimensiones inferiores. La reducción de dimensionalidad aprovecha la estructura y la correlación y nos permite trabajar con una representación más compacta de los datos, idealmente sin perder información. Podemos pensar en la reducción de dimensionalidad como una técnica de compresión, similar a jpeg o mp3, que son algoritmos de compresión para imágenes y música.
En este capítulo, discutiremos el análisis de componentes principales (PCA), un algoritmo para la reducción de dimensionalidad lineal. La PCA, propuesta por Pearson (1901) y Hotelling (1933), existe desde hace más de 100 años y sigue siendo una de las técnicas más utilizadas para la compresión y visualización de datos. También se utiliza para la identificación de patrones simples, factores latentes y estructuras de datos de alta dimensión. 


En esta notebooks vamos a explorar el hallazgo de las componentes principales en alguno de los datasets vistos en clases anteriores. La idea de componentes principales (y de muchas otras técnicas de reducción dimensional) es encontrar una combinación de los features originales que condensen gran parte de la variabilidad de nuestros datos. La utilidad de esto radica en poder:
- visualizar los datos en un espacio mucho más chico que el espacio original;
- encontrar direcciones que condensen la variación de features fuertemente correlacionados y, por lo tanto, eliminar información redundante;
- alimentar modelos de regresión o clasificación con menos variables independientes;
- comprimir información (parte 2).

La descomposición en componentes principales es parte del conjunto de algoritmos conocidos como de **aprendizaje no-supervisado**. Esto se debe a que estos algoritmos trabajan sobre el conjunto de features, sin que exista una variable que querramos predecir (variable *target*).

### Definiendo el Problema


  $X = (x_1 , x_2 , \dots , x_n )_{N \times K}$


Factor: 


  $F = X\delta \,\,\,\delta \in K$



- Idea: summarize the K variables in a single (F).
- Vocab: the coefficients of $\delta$ are the loadings: how much 'matters' each x s in the factor.
- Dimensionality: summarize the original K variables in a few $q <K$ factors.



### Algebra Review







### Factors via main components}




- $x_1, x_2, \dots, x_K$ , K vectors of N observations each.

- Factor: $F = X\delta$

- What is the 'best' linear combination of $x_1, x_2, \dots, x_K$ ?

- Best? Maximum variance. Why? The one that best reproduces variability original of all xs




### Factors via main components}


  - Let
  
    - $X = (x_1 , \dots , x_K)_{N \times K}$  , 
    - $\Sigma V(X)$ 
    - $\delta \in K$
 
  
  - $F = X\delta$ is a linear combination of $X$, with $V (X\delta) = \delta' \Sigma \delta$.
  
  - Let's set up the problem as 
  \begin{align}
  \underset{\delta}{max}\,\,\, \delta' \Sigma \delta
  \end{align}
 
  - It is obvious that the solution is to bring $\delta$ to infinity. 
 

 

### Factors via main components}


- Let's "fix" the problem by normalizing $\delta$

\begin{align}
\underset{\delta}{max} \delta' \Sigma \delta \\ \nonumber
\text{subject to}  \\ \nonumber
\delta' \delta = 1 \nonumber
\end{align}
- Let us call the solution to this problem $\delta^*$. 

- $F^* = X\delta^*$ is the 'best' linear combination of X. 

- Result: $\delta^*$ is the eigenvector corresponding to the largest eigenvalue of $\Sigma = V (X)$.

- $F^* = X\delta^*$ is the first principal component of $X$.

- Intuition: $X$ has $K$ columns and $Y = X\delta$ has only one. The factor built with the first principal component is the best way to represent the K variables of X using a single single variable.

- Let
  
    - $X = (x_1 , \dots , x_K)_{N \times K}$  , 
    - $\Sigma = V(X)$ 
    - $\delta \in K$
 
  
  - $F = X\delta$ is a linear combination of $X$, with $V (X\delta) = \delta' \Sigma \delta$.
  
  - Let's set up the problem as 
  \begin{align}
  \underset{\delta}{max}\,\,\, \delta' \Sigma \delta
  \end{align}
 
  - It is obvious that the solution is to bring $\delta$ to infinity. 
 




- Let's "fix" the problem by normalizing $\delta$

\begin{align}
\underset{\delta}{max}\,\, \delta' \Sigma \delta \\ \nonumber
\text{subject to}  \\ \nonumber
\delta' \delta = 1 \nonumber
\end{align}
- Let us call the solution to this problem $\delta^*$. 

- $F^* = X\delta^*$ is the 'best' linear combination of X. 

- Intuition: $X$ has $K$ columns and $Y = X\delta$ has only one. The factor built with the first principal component is the best way to represent the K variables of X using a single single variable.








- Solution to the problem of the first principal component
- Problem: 
\begin{align}
\underset{\delta}{max}\,\, \delta' \Sigma \delta \,\, \text{  s.t.}  \,\, \delta' \delta = 1 \nonumber
\end{align}
- Seting up the Lagrangian $$\mathcal{L}(\delta,\lambda) = \delta' \Sigma \delta + \lambda(1-\delta'\delta)$$

- CPO

\begin{align}
\Sigma \delta = \lambda \delta
\end{align}

- At the optimum, $\delta$ is the eigenvector corresponding to the eigenvalue $\lambda$ of $\Sigma$. 
- Premultiplying by $\delta$ and  remembering that $\delta'\delta = 1$:
\begin{align}
\delta \Sigma \delta = \lambda
\end{align}
\footnotesize
- In order to maximize $\delta \Sigma \delta $ we must choose $\lambda$ equal to the maximum eigenvalue of $\Sigma$ and $\delta$ is the corresponding eigenvalue.





\frametitle{Factors is unsupervised learning}



- Recall that 

  - In regression we had
  \begin{align}
  y =X \beta +u
  \end{align}
 - We minimized the MSE
 \begin{align}
 min \sum (y_i-\hat{y})^2
 \end{align}

- Learning is supervised: the difference between $y$ and $\hat y$ ``guides'' the learning.

-  The factor construction problem is unsupervised: we construct an index (the factor) without ever seeing it.

- We start with 
\begin{align}
X_{n\times k}
\end{align}
- We end  with 
\begin{align}
F^*_{n\times 1} = X\delta^*
\end{align}







\frametitle{q main components}



- The first main component? Are there others?

- Let's consider the following problem:
\begin{align}
\underset{\delta_2}{max}\,\, \delta_2' \Sigma \delta_2 \\ \nonumber
\text{subject to}  \\ \nonumber
\delta_2' \delta_2 &= 1 \\ \nonumber
and \\ \nonumber
Cov(\delta'_2 X,\delta^{*'}X) &=0 \\ \nonumber
\end{align}

- $F_2^*=X\delta^*_2$ is the second principal component : the best linear combination which is
orthogonal to the best initial linear combination.
- Recursively, using this logic you can form q  main components. 
- Note that algebraically we could construct $q = K$ factors.




\frametitle{q main components}


- Let $\lambda_1,\dots,\lambda_K$ be the eigenvalues of $\Sigma = V(X)$, ordered from highest to lowest, and $p_1 , \dots , p_K$ the corresponding eigenvectors. Let us call $P$ the matrix of eigenvectors.

- Result: $\delta_j = p_j$ , $\forall j$ ('loadings' of the principal components =ordered eigenvectors of $\Sigma$).

- Let $F_j = X \delta_j$ , $j = 1, \dots, K$ be the j-th principal component. It's easy to see that
\begin{align}
V (F_j ) = \delta'_j \Sigma \delta_j = p_j P\Lambda P p_j = \lambda_j
\end{align}

(the variance of the j-th principal component is the j-th ordered eigenvalue of $\Sigma$).







\frametitle{Relative importance of factors}



- The total variance of X is the sum of the variances of $x_j$ , $j = 1, ..., K$, that is $trace(\Sigma)$
- It is easy to show that:
\begin{align}
trace(\Sigma) = trace(P \Lambda P')= trace(PP' \Lambda ) = \sum_{j=1}^K \lambda_j= \sum_{j=1}^K V(F_j)
\end{align}
- Then

\begin{align}
\frac{\lambda_k}{\sum_{j=1}^K \lambda_j}
\end{align}

- measures the relative importance of the jth principal component.




\frametitle{Selection of factors}




- Look at the importance of the first principal components. If the first one explains a lot, there is really only one dimension (one dimension explains almost everything).
\bigskip
- The coefficients of the eigenvectors are weights. See how each of the variables 'contributes' in each factor.
\bigskip
- Beware of differences in scale. Always standardize 





\frametitle{Selection of factors}


- Let the columns of X be standardized, so that each variable has unit variance. 
- In this case:

\begin{align}
trace(\Sigma) =  \sum_{j=1}^K V(F_j) = K
\end{align}

- and recall $\sum_{j=1}^K \lambda_j= \sum_{j=1}^K V(F_j)$ then

\begin{align}
 \sum_{j=1}^K \lambda_j = K
\end{align}

- On average, each factor contributes one unit. When $\lambda_j>1$, that factor it explains the total variance more than the average. $\rightarrow$ Retain the factors with $\lambda_j > 1$ 






%----------------------------------------------------------------------%
## Factor Computation}
Useful Tips: Factor Computation





- As a practical aside, note that \texttt{prcomp} converts x here from sparse to dense matrix storage.

- For really big text DTMs, which will be very sparse, this will cause you to run out of memory. 

- A big data strategy for PCA is to first calculate the covariance matrix for x and then obtain PC rotations as the eigenvalues of this covariance matrix. 

- The first step can be done using sparse matrix algebra. 

- The rotations are then available as




%----------------------------------------------------------------------%
## Factor Interpretation




- $F_s = X\delta_s$ : 'loadings' often suggest that a factor works as a 'index' of a group of variables.
\bigskip
- Idea: look at the 'loadings'
\bigskip
- Caution: factors via principal components are orthogonal recursively.






### Factor Interpretation: Example



- {\bf Congress and \theme Roll Call Voting}
\bigskip

  - Votes in which names and positions are recorded are called `roll calls'.
  
  - The site {\tt voteview.com} archives vote records and the R package { \tt pscl} has tools for this data.
  
  - 445 members in the last US House  (the $111^{th}$)
  
  - 1647 votes:  \theme nea = -1, \nv yea=+1, \gr missing = 0.
  
  - This leads to a large matrix of observations that can probably be reduced to simple factors {\gr (party)}.





