# Principal Component Analysis
## Basics
http://www.sthda.com/english/wiki/principal-component-analysis-the-basics-you-should-read-r-software-and-data-mining
### Covariance
**Variance** of x $\sigma ^{ 2 }_{ xx }=\frac { \sum _{ i=1 }^{ n }{ (x_{ i }-\mu _{ x })^{ 2 } }  }{ n-1 }$

**Covariance** of x and y $\sigma ^{ 2 }_{ xy }=\frac { \sum _{ i=1 }^{ n }{ (x_{ i }-\mu _{ x })(y_{ i }-\mu _{ y }) }  }{ n-1 }$

The covariance measures the degree of the relationship between x and y.

### Covariance matrix
A covariance matrix (also called **correlation matrix**) contains the covariances between all possible pairs of variables in the data set.

In [33]:
df <- iris[, -5]
(res.cov <- round(cov(df), 2))

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
Sepal.Length,0.69,-0.04,1.27,0.52
Sepal.Width,-0.04,0.19,-0.33,-0.12
Petal.Length,1.27,-0.33,3.12,1.3
Petal.Width,0.52,-0.12,1.3,0.58


*Covariance matrix is symmetrix : cov(x, y) = cov(y, x)*

The **diagonal** elements are the variances of the different variables.

In [34]:
diag(res.cov)

The **off-diagonal** values are the covariances between variables.

In [35]:
res.cov.off <- round(outer(1:nrow(res.cov), 1:nrow(res.cov), FUN = Vectorize(function(y, x) { ifelse(y > x, res.cov[y, x] , NA) })), 2)
dimnames(res.cov.off) <- dimnames(res.cov)
res.cov.off

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
Sepal.Length,,,,
Sepal.Width,-0.04,,,
Petal.Length,1.27,-0.33,,
Petal.Width,0.52,-0.12,1.3,


They reflect **distortions** in the data (noise, redundancy, ...).

Values different from zero indicate the presence of **redundancy** in the data, i.e. there is a certain amount of **correlation** between variables.

### Minimize distorsion

Covariance matrix is a **non-diagonal matrix** (its off-diagonal value are different from zero).

A diagonal matrix:

In [36]:
dm <- diag(x = 1, 4)
dimnames(dm) <- dimnames(res.cov)
dm

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
Sepal.Length,1,0,0,0
Sepal.Width,0,1,0,0
Petal.Length,0,0,1,0
Petal.Width,0,0,0,1


To **diagonalize** the covariance matrix (i.e. change it so that the off–diagonal elements are close to zero = zero correlation between pairs of distinct variables)), we need to redefine our initial variables.

$Sepal.Length'=a_1*Sepal.Length + a_2*Sepal.Width + a_3 * Petal.Length + a_4 * Petal.Width$

$Sepal.Width'=b_1*Sepal.Length + b_2*Sepal.Width + b_3 * Petal.Length + b_4 * Petal.Width$

$Petal.Length'=c_1*Sepal.Length + c_2*Sepal.Width + c_3 * Petal.Length + c_4 * Petal.Width$

$Petal.Width'=d_1*Sepal.Length + d_2*Sepal.Width + d_3 * Petal.Length + d_4 * Petal.Width$

Constants $a_i$, $b_i$, $c_i$, $d_i$ will be calculated so that covariance matrix is diagonal.

### Eigenvalues, eigenvectors

**Eigenvalues** are the numbers on the diagonal of the diagonalized covariance matrix.

**Eigenvectors** of the covariance matrix are the directions of the new rotated axes.

In [37]:
eigen(res.cov)

0,1,2,3
0.3607931,-0.6665999,-0.5952227,0.2668011
-0.08416404,-0.71962541,0.6193041,-0.3025196
0.8566357,0.1848382,0.1011551,-0.4709328
0.35905424,0.06015556,0.50193627,0.78455168


### Compute principal component analysis

In [38]:
head(df)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
1,5.1,3.5,1.4,0.2
2,4.9,3.0,1.4,0.2
3,4.7,3.2,1.3,0.2
4,4.6,3.1,1.5,0.2
5,5.0,3.6,1.4,0.2
6,5.4,3.9,1.7,0.4


#### 1-Center and scale data

In [39]:
df.scaled <- scale(df, center = TRUE, scale = TRUE)
head(round(df.scaled, 2))

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
-0.9,1.02,-1.34,-1.31
-1.14,-0.13,-1.34,-1.31
-1.38,0.33,-1.39,-1.31
-1.5,0.1,-1.28,-1.31
-1.02,1.25,-1.34,-1.31
-0.54,1.93,-1.17,-1.05


#### 2-Compute covariance matrix 

In [40]:
res.cor <- cor(df.scaled)
round(res.cor, 2)

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
Sepal.Length,1.0,-0.12,0.87,0.82
Sepal.Width,-0.12,1.0,-0.43,-0.37
Petal.Length,0.87,-0.43,1.0,0.96
Petal.Width,0.82,-0.37,0.96,1.0


*Diagonal values = 1 because of data scaling (step 1).*

#### 3-Calculate eigenvectors/eigenvalues of correlation matrix

In [41]:
(res.eig <- eigen(res.cor))

0,1,2,3
0.5210659,-0.3774176,0.7195664,0.2612863
-0.2693474,-0.9232957,-0.2443818,-0.1235096
0.5804131,-0.02449161,-0.14212637,-0.80144925
0.56485654,-0.06694199,-0.63427274,0.52359713


Eigenvalues are in decreasing order.

#### 4-Compute new dataset

In [42]:
df.new <- df.scaled %*% res.eig$vectors
colnames(df.new) <- c("PC1", "PC2", "PC3", "PC4")
head(df.new)

PC1,PC2,PC3,PC4
-2.25714118,-0.47842383,0.12727962,0.02408751
-2.074013,0.6718827,0.2338255,0.1026628
-2.35633511,0.34076642,-0.0440539,0.02828231
-2.29170679,0.59539986,-0.0909853,-0.06573534
-2.3818627,-0.64467566,-0.01568565,-0.03580287
-2.068700608,-1.484205297,-0.02687825,0.006586116


## prcomp() versus princomp()
http://www.sthda.com/english/wiki/principal-component-analysis-in-r-prcomp-vs-princomp-r-software-and-data-mining