https://en.wikipedia.org/wiki/Principal_component_analysis

## Introduction 
PCA has many names in various fields. For exmaple, it is called singular value decomposition (SVD) of X, eigenvalue decomposition (EVD) of $X^TX$ in linear algebra, factor analysis (for a discussion of the differences between PCA and factor analysis see Ch. 7 of Jolliffe's Principal Component Analysis. These statements suggest PCA is at least strongly related to SVD. 

PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It is often used to visualize genetic distance and relatedness between populations. PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value  decomposition of a data matrix, usually after a normalization step of the initial data. **Comments: covariance or correlation matrix must be square matrix and hence can be done with eigenvalue decomposion**. The normalization of each attribute consists of mean centering – subtracting each data value from its variable's measured mean so that its empirical mean (average) is zero – and, possibly, normalizing each variable's variance to make it equal to 1.

PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation can be thought of as revealing the internal structure of the data in a way that best explains the variance in the data. If a multivariate dataset is visualized as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA can supply the user with a lower-dimensional picture, a projection of this object when viewed from its most informative viewpoint. This is done by using only the first few principal components so that the dimensionality of the transformed data is reduced.

PCA is closely related to factor analysis. Factor analysis typically incorporates more domain specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix. **So FA is basically based on eigenvector analysis**. 

PCA is also related to canonical correlation analysis (CCA). CCA defines coordinate systems that optimally describe the cross-covariance between two datasets while PCA defines a new orthogonal coordinate system that optimally describes variance in a single dataset. 


## Relation of PCA and SVD
https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca
Check the above link and several related links in the end. 
### Mapping PCA terms to SVD
* Due to the usual definition of $n\times p$ matrix $X$, where $n$ is the number of samples and $p$ is the number of variables, the $p\times p$ covariance matrix $C = \frac{X^TX}{n-1}$, but not $\frac{XX^T}{n-1}$. Also this form is only true when the data is centered.  The covariance matrix $C$ is symmetric and thus can always be diagonalizable as $C=VLV^T$. Note the eigenvalues $\lambda_i$ are in the decreasing order on the diagonal matrix $L$. 
* The eigenvectors in $V$ are called **principal axes** or **principal directions** of the data. 
* Projections of the data on the principal axes are called **principal components** or **PC scores**. 
* The $jth$ principal component is given by $jth$ column of $XV$. The coordinates of the $ith$ data point in the new PC space are given by the $ith$ row of $XV$. 
* Assuming SVD of $X = USV^T$, then it is easy to show $C = V\frac{S^2}{n-1}V^T$. This means that right singular vectors $V$ are principal directions and that singular values are related to the eigenvalues of covariance matrix via $\lambda_i = \frac{s_i^2}{n-1}$. Note eigenvalues $\lambda_i$ show variances of the respective PCs. Principal components (scores) are given by $XV = USV^TV = US$.
* Standardized score are given by columns of $\sqrt{n-1}U$ and loadings are given by columns of $\frac{VS}{n-1}$. See https://stats.stackexchange.com/questions/125684/how-does-fundamental-theorem-of-factor-analysis-apply-to-pca-or-how-are-pca-l and https://stats.stackexchange.com/questions/143905/loadings-vs-eigenvectors-in-pca-when-to-use-one-or-another for why 'loadings' should not be confused with principal directions. 
* **The above is correct only** if $X$ is centered and only the covariance matrix is equal to $X^TX/(n-1)$. 
* **The above is correct only** for $X$ having samples in rows and variables in columns. Otherwise, $U$ and $V$ exchange interpretations.
* If one wants to perform PCA on a correlation matrix (instead of a covariance matrix), then columns of $X$ should not only be centered, but standardized as well, i.e., divided by their standard deviations. 
* To reduce the dimensionality of the data from $p$ to $k \lt p$, select $k$ first columns of $U$, and $k\times k$ upper-left part of $S$. Their product $U_kS_k$ is the required $n\times k$ matrix containing first $k$ PCs. 
* Further multiplying the first $k$ PCs by the corresponding principal axes $V_k^T$ yields $X_k = U_kS_kV_k^T$ matrix that has the original $n\times p$ size but is of lower rank (of rank $k$). This matrix $X_k$ provides a reconstruction of the original data from the first $k$ PCs. It has the lowest possible reconstruction error, as shown in https://stats.stackexchange.com/questions/130721/what-norm-of-the-reconstruction-error-is-minimized-by-the-low-rank-approximation. 
* Strictly speaking, $US$ is of $n\times n$ size and $V$ is of $p\times p$ size. However, if $n\gt p$ then the last $n-p$ columns of $U$ are arbitrary (and corresponding rows of $S$ are zero); one should therefore use an economy size (or thin) SVD that returns $U$ of $n\times p$ size, dropping the useless columns. For large $n>>p$ the matrix $U$ would otherwise be unnecessarily huge. The same applies for an opposite situation of $n<<p$. 

### Is there any advantage of SVD over PCA?
SVD is a numerical method and PCA is an analysis approach (like least squares). You can do PCA using SVD, or you can do PCA doing the eigen-decomposition of $X^TX$ (or $XX^T$), or you can do PCA using many other methods, just like you can solve least squares with a dozen different algorithms like Newton's method or gradient descent or SVD etc.

So there is no "advantage" to SVD over PCA because it's like asking whether Newton's method is better than least squares: the two aren't comparable.

### Intuitive picture of PCA 
https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579

* PCA checks what characteristics are redundant and discards them? No, PCA is not selecting some characteristics and discarding the others. Instead, it constructs some new characteristics that turn out to summarize our list of wines well. Of course these new characteristics are constructed using the old ones; for example, a new characteristic might be computed as wine age minus wine acidity level or some other combination like that (we call them linear combinations). In fact, PCA finds the best possible characteristics, the ones that summarize the list of wines as well as only possible (among all conceivable linear combinations). This is why it is so useful.

* What do you actually mean when you say that these new PCA characteristics "summarize" the list of wines? (A) First answer is that you are looking for some wine properties (characteristics) that strongly differ across wines. PCA looks for properties that show as much variation across wines as possible. (B) The second answer is that you look for the properties that would allow you to predict, or "reconstruct", the original wine characteristics. So PCA looks for properties that allow to reconstruct the original characteristics as well as possible. (C) Surprisingly, it turns out that these two aims are equivalent and so PCA can kill two birds with one stone.

* Why would be equivalent? See the excellent animation in the website. 
If the line is along the direction with most variance, then the total projection error will be minimized. This is because most of data are now almost on the line and thus with almost zero errors. 

* How does it related to Pythagoras theorem, and eigenvectors and eigenvalues? See the explanation on the website. Does it related to orthogonal projection? Also check the link https://stats.stackexchange.com/questions/217995/what-is-an-intuitive-explanation-for-how-pca-turns-from-a-geometric-problem-wit and see why maximize the variance is equivalent to finding eigenvectors. 


In [2]:
#The code is from the followin link
#https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca
import numpy as np
from numpy import linalg as la
np.random.seed(42)


def flip_signs(A, B):
    """
    utility function for resolving the sign ambiguity in SVD
    http://stats.stackexchange.com/q/34396/115202
    """
    signs = np.sign(A) * np.sign(B)
    return A, B * signs


# Let the data matrix X be of n x p size,
# where n is the number of samples and p is the number of variables
n, p = 5, 3
X = np.random.rand(n, p)
# Let us assume that it is centered
X -= np.mean(X, axis=0)

# the p x p covariance matrix
C = np.cov(X, rowvar=False)
print ("C = \n", C)
# C is a symmetric matrix and so it can be diagonalized:
l, principal_axes = la.eig(C)
# sort results wrt. eigenvalues
idx = l.argsort()[::-1]
l, principal_axes = l[idx], principal_axes[:, idx]
# the eigenvalues in decreasing order
print( "l = \n", l)
# a matrix of eigenvectors (each column is an eigenvector)
print( "V = \n", principal_axes)
# projections of X on the principal axes are called principal components
principal_components = X.dot(principal_axes)
print ("Y = \n", principal_components)

# we now perform singular value decomposition of X
# "economy size" (or "thin") SVD
U, s, Vt = la.svd(X, full_matrices=False)
V = Vt.T
S = np.diag(s)

# 1) then columns of V are principal directions/axes.
assert np.allclose(*flip_signs(V, principal_axes))

# 2) columns of US are principal components
assert np.allclose(*flip_signs(U.dot(S), principal_components))

# 3) singular values are related to the eigenvalues of covariance matrix
assert np.allclose((s ** 2) / (n - 1), l)

# 8) dimensionality reduction
k = 2
PC_k = principal_components[:, 0:k]
US_k = U[:, 0:k].dot(S[0:k, 0:k])
assert np.allclose(*flip_signs(PC_k, US_k))

# 10) we used "economy size" (or "thin") SVD
assert U.shape == (n, p)
assert S.shape == (p, p)
assert V.shape == (p, p)

C = 
 [[ 0.09338628 -0.11086559 -0.02943783]
 [-0.11086559  0.18770817  0.0336127 ]
 [-0.02943783  0.0336127   0.12511719]]
l = 
 [0.27418905 0.11232653 0.01969604]
V = 
 [[ 0.53435576  0.10510519 -0.83869948]
 [-0.79577968 -0.27194755 -0.54109078]
 [-0.28495372  0.95655498 -0.06167616]]
Y = 
 [[-0.5382821   0.04170504 -0.17101639]
 [ 0.37801268 -0.26959854  0.10654358]
 [-0.60281427 -0.09375913  0.14821045]
 [ 0.31232627  0.5572872   0.03786103]
 [ 0.45075742 -0.23563458 -0.12159868]]
