# PCA from scratch

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Algorithm
**Input:**  
- `X`: an array of shape `(N,d)` whose rows are samples and columns are features `## data`
- `r`: target dimension 

**Output:**
- an array of shape `(N, r)`  

**Steps:**
1. Shift $X$ so that the rows of $X$ are centered at the origin.  
2. Let $C = \frac{1}{N}X^\top X$.  
3. Suppose ${\bf u}_1,\ldots, {\bf u}_d$ form an orthonormal eigenbasis of $C$ corresponding to the eigenvalues $\lambda_1\geq\cdots\geq\lambda_d$.  Let $U$ be the matrix whose columns are ${\bf u}_1,\ldots, {\bf u}_r$.  
4. Return $XU$.

## Pseudocode
Translate the algorithm into the pseudocode.  
This helps you to identify the parts that you don't know how to do it.  

    1. Shift X to make it centered at origin 
       mu = X.mean(axis=0)
       X = X - mu
       Or to write in one line "X = X - X.mean(axis=0)"
    2. C = 1/N * X.T * X (covariance mtx)
    3. Diagonalize --> vals,vecs (from large to small)
       U = vecs[:,:r]
    4. Return X.dot(U)

## Code

In [None]:
def PCAincode(X,r): ## given data, target dim
    X = X - X.mean(axis=0) ## make it centered at origin
    N = X.shape[0] ## N components
    C = (X.T.dot(X))/N # convariance mtx
    vals,vecs = np.linalg.eigh(C)
    vals = vals[::-1]
    vecs = vecs[:,::-1] ## make it from large to small
    U = vecs[:,:r]
    return X.dot(U)

## Test
Take some sample data from [PCA-with-scikit-learn](PCA-with-scikit-learn.ipynb) and check if your code generates similar outputs with the existing packages.

##### ML1_Exercise 2
Let  
```python
X = np.genfromtxt('hidden_text.csv', delimiter=',')
```
This data has all its points lie in a two-dimensional plane embedded in a much higher dimension.  

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')

In [None]:
### results with your code
%matplotlib inline
plt.axis('equal')
print(PCAincode(X,2).shape)
plt.scatter(*PCAincode(X,2).T,color='blue')
## may be upside down, because principal component may have 4 choice in 2-dim 

In [None]:
## with scikit learn
from sklearn.decomposition import PCA
model = PCA(2)
X_new = model.fit_transform(X)
%matplotlib inline
plt.axis('equal')
plt.scatter(*X_new.T,color='blue')

## Comparison

##### Exercise 1
The center of rows of $X$ (before shift) is supposed to be `model.mean_` .  
Check if this is true.

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')

from sklearn.decomposition import PCA
model = PCA(2)
X_new = model.fit_transform(X)

In [None]:
print((model.mean_==X.mean(axis=0)).all())
## The center of rows of X (before shift) is supposed to be model.mean_.

##### Alex:
It is better to use `np.isclose` instead of `==`.

##### Exercise 2
The matrix $U^\top$ is supposed to be `model.components_` .  
(Up to some negations.)   
Check if this is true.

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')

r = 2
X = X - X.mean(axis=0) ## make it centered at origin
N = X.shape[0] ## N components
C = (X.T.dot(X))/N # convariance mtx
vals,vecs = np.linalg.eigh(C)
vals = vals[::-1]
vecs = vecs[:,::-1] ## make it from large to small
U = vecs[:,:r]

from sklearn.decomposition import PCA
model = PCA(2)
X_new = model.fit_transform(X)

In [None]:
for i in range(model.components_.shape[0]): ## 2*100
    u , c = U.T[i] , model.components_[i]
    if np.isclose(u,c).all() or np.isclose(u,-c).all(): ## may be upside down, like example above
        print(True)

##### Alex:
You have to count the number of `True` in the output and check whether it is equal to the target dimension.
Instead, you may use the following code. It will output a `False` if `U.T` is not equal to `model.components_`.
```Python
for u,c in zip(U.T,model.components_):
    print(np.isclose(u,c).all() or np.isclose(u,-c).all())
```

##### Exercise 3
The sequence $\lambda_1,\ldots,\lambda_r$ are suppose to be the values in `explained_variance_` .  
Check if this is true.

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')

r = 2
X = X - X.mean(axis=0) ## make it centered at origin
N = X.shape[0] ## N components
C = (X.T.dot(X))/N # convariance mtx
vals,vecs = np.linalg.eigh(C)
vals = vals[::-1]
vecs = vecs[:,::-1] ## make it from large to small
U = vecs[:,:r]

from sklearn.decomposition import PCA
model = PCA(2)
X_new = model.fit_transform(X)

In [None]:
print(np.isclose(vals[:r],model.explained_variance_).all())
## numpy.isclose(a, b, rtol=1e-05, atol=1e-08, equal_nan=False)
## for this the error is large than built-in, so adjust the atol like below

In [None]:
print(np.isclose(vals[:r],model.explained_variance_,atol=20).all())

##### Exercise 4
Let $t = \operatorname{tr}(C)$.  
The sequence $\lambda_1/t,\ldots,\lambda_r/t$ are suppose to be the values in `explained_variance_ratio_` .  
Check if this is true.

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')

r = 2
X = X - X.mean(axis=0) ## make it centered at origin
N = X.shape[0] ## N components
C = (X.T.dot(X))/N # convariance mtx
vals,vecs = np.linalg.eigh(C)
vals = vals[::-1]
vecs = vecs[:,::-1] ## make it from large to small
U = vecs[:,:r]

from sklearn.decomposition import PCA
model = PCA(2)
X_new = model.fit_transform(X)

In [None]:
t = np.trace(C)
print(np.isclose(vals[:r]/t,model.explained_variance_ratio_).all())

##### Exercise 5
The singular values of the shifted `X` are supposed to be `model.singular_values_` .  
Check if this is true.

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')

r = 2
X = X - X.mean(axis=0) ## make it centered at origin
N = X.shape[0] ## N components
C = (X.T.dot(X))/N # convariance mtx
vals,vecs = np.linalg.eigh(C)
vals = vals[::-1]
vecs = vecs[:,::-1] ## make it from large to small
U = vecs[:,:r]

from sklearn.decomposition import PCA
model = PCA(2)
X_new = model.fit_transform(X)

In [None]:
U,s,VT = np.linalg.svd(X)
print(np.isclose(s[:r],model.singular_values_).all())
## print(np.allclose(s[:2],model.singular_values_))