# PCA from scratch

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Algorithm
**Input:**  
- `X`: an array of shape `(N,d)` whose rows are samples and columns are features
- `r`: target dimension

**Output:**
- an array of shape `(N, r)`  

**Steps:**
1. Shift $X$ so that the rows of $X$ are centered at the origin.  
2. Let $C = \frac{1}{N}X^\top X$.  
3. Suppose ${\bf u}_1,\ldots, {\bf u}_d$ form an orthonormal eigenbasis of $C$ corresponding to the eigenvalues $\lambda_1\geq\cdots\geq\lambda_d$.  Let $U$ be the matrix whose columns are ${\bf u}_1,\ldots, {\bf u}_r$.  
4. Return $XU$.

## Pseudocode
Translate the algorithm into the pseudocode.  
This helps you to identify the parts that you don't know how to do it.  

    1. X = X - X.mean(axis=0)
    2. N,d = X.shape
       C = (X.T.dot(X))/N
    3. vals,vecs = np.linalg.eigh(C)
       vals = vals[::-1]
       vecs = vecs[:,::-1]
       U = vecs[:,:r]
    4. return X@U

## Code

In [None]:
def PCAscratch(X,r): # X is a set of given data, r is our target dimension.
    X = X - X.mean(axis=0) # Shift X so that the rows of X are centered at the origin.
    N,d = X.shape # X is an array of shape (N,d), and we need the number of N to calculate C.
    C = (X.T.dot(X))/N # C is a convariance matrix.
    vals,vecs = np.linalg.eigh(C) # Get the eigenvalues and eigenvectors of C.
    vals = vals[::-1] # Make eigenvalues from large to small.
    vecs = vecs[:,::-1] # Make eigenvectors from large to small.
    U = vecs[:,:r] # U only need the first r columns of vecs matrix.
    return X@U # Return XU.

## Test
Take some sample data from [PCA-with-scikit-learn](PCA-with-scikit-learn.ipynb) and check if your code generates similar outputs with the existing packages.

##### Name of the data
X = np.genfromtxt('hidden_text.csv', delimiter=',')

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',') 
# I also tried the load_digits, but just too difficult to check the differences.

In [None]:
### results with your code
print(PCAscratch(X,2))

%matplotlib inline
plt.axis('equal')
plt.scatter(*PCAscratch(X,2).T)

In [None]:
### with scikit learn
from sklearn.decomposition import PCA
model = PCA(2)
X_new = model.fit_transform(X)
print(X_new)

%matplotlib inline
plt.axis('equal')
plt.scatter(*X_new.T)

##### Check : 兩者差別就是上下顛倒，所以他們是很相似的。

## Comparison

##### Exercise 1
The center of rows of $X$ (before shift) is supposed to be `model.mean_` .  
Check if this is true.

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')

r = 2
X = X - X.mean(axis=0) # Shift X so that the rows of X are centered at the origin.
N,d = X.shape # X is an array of shape (N,d), and we need the number of N to calculate C.
C = (X.T.dot(X))/N # C is a convariance matrix.
vals,vecs = np.linalg.eigh(C) # Get the eigenvalues and eigenvectors of C.
vals = vals[::-1] # Make eigenvalues from large to small.
vecs = vecs[:,::-1] # Make eigenvectors from large to small.
U = vecs[:,:r] # U only need the first r columns of vecs matrix.

from sklearn.decomposition import PCA
model = PCA(r)
X_new = model.fit_transform(X)

Ans = np.allclose(X.mean(axis=0),model.mean_) 
# If center of rows of X (before shift) and model.mean_ are same (very close), Ans will equal True.
print(Ans)

##### Exercise 2
The matrix $U^\top$ is supposed to be `model.components_` .  
(Up to some negations.)  
Check if this is true.

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')

r = 2
X = X - X.mean(axis=0) # Shift X so that the rows of X are centered at the origin.
N,d = X.shape # X is an array of shape (N,d), and we need the number of N to calculate C.
C = (X.T.dot(X))/N # C is a convariance matrix.
vals,vecs = np.linalg.eigh(C) # Get the eigenvalues and eigenvectors of C.
vals = vals[::-1] # Make eigenvalues from large to small.
vecs = vecs[:,::-1] # Make eigenvectors from large to small.
U = vecs[:,:r] # U only need the first r columns of vecs matrix.

from sklearn.decomposition import PCA
model = PCA(r)
X_new = model.fit_transform(X)

Ans = True # Suppose Ans is True at first.
# 因為每個主成分(向量)可以是負的也可以是正的，只要是同一條線上就好了。
# 所以我們在做比較的時候，要把乘上一個負號就相等的情況也考慮進去。 (這段不太會用英文講)
for i in range(r) :
    if np.allclose(U.T[i],model.components_[i])| np.allclose(-U.T[i],model.components_[i]) == False :
        Ans = False # 如果有任何一個主成分 (U.T的橫列) 不符合上述情況，就代表UT不是model.components_，把Ans改False然後break迴圈。
        break
# 如果跑完迴圈 (比較完每個主成分) Ans仍為True，我們就說他們是相等的 (很靠近)。
print(Ans)

##### Exercise 3
The sequence $\lambda_1,\ldots,\lambda_r$ are suppose to be the values in `explained_variance_` .  
Check if this is true.

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')

r = 2
X = X - X.mean(axis=0) # Shift X so that the rows of X are centered at the origin.
N,d = X.shape # X is an array of shape (N,d), and we need the number of N to calculate C.
C = (X.T.dot(X))/N # C is a convariance matrix.
vals,vecs = np.linalg.eigh(C) # Get the eigenvalues and eigenvectors of C.
vals = vals[::-1] # Make eigenvalues from large to small.
vecs = vecs[:,::-1] # Make eigenvectors from large to small.
U = vecs[:,:r] # U only need the first r columns of vecs matrix.

from sklearn.decomposition import PCA
model = PCA(r)
X_new = model.fit_transform(X)

print(np.allclose(vals[:r],model.explained_variance_)) 
# Their errors are bigger than model.explained_variance_*rtol (rtol = 1.e-5), so we cannot say they are same under rtol = 1.e-5.
print(model.explained_variance_) # Observe the value of model.explained_variance_.
print(vals[:r]) # Observe the value of vals[:r].
# I think this two arrays are very close, so I try to  adjust the rtol of np.allclose function.
Ans = np.allclose(vals[:r],model.explained_variance_,rtol = 1.e-3)
print(Ans)
# Hence, under the rtol = 1.e-3, we can say they are same (very close).

##### Exercise 4
Let $t = \operatorname{tr}(C)$.  
The sequence $\lambda_1/t,\ldots,\lambda_r/t$ are suppose to be the values in `explained_variance_ratio_` .  
Check if this is true.

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')

r = 2
X = X - X.mean(axis=0) # Shift X so that the rows of X are centered at the origin.
N,d = X.shape # X is an array of shape (N,d), and we need the number of N to calculate C.
C = (X.T.dot(X))/N # C is a convariance matrix.
vals,vecs = np.linalg.eigh(C) # Get the eigenvalues and eigenvectors of C.
vals = vals[::-1] # Make eigenvalues from large to small.
vecs = vecs[:,::-1] # Make eigenvectors from large to small.
U = vecs[:,:r] # U only need the first r columns of vecs matrix.

from sklearn.decomposition import PCA
model = PCA(r)
X_new = model.fit_transform(X)

t = np.trace(C)
Ans = np.allclose(vals[:r]/t,model.explained_variance_ratio_) 
# If the sequence 𝜆1/𝑡,…,𝜆𝑟/𝑡 and explained_variance_ratio_ are same (very close), Ans will equal True.
print(Ans)

##### Exercise 5
The singular values of the shifted `X` are supposed to be `model.singular_values_` .  
Check if this is true.

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')

r = 2
X = X - X.mean(axis=0) # Shift X so that the rows of X are centered at the origin.
N,d = X.shape # X is an array of shape (N,d), and we need the number of N to calculate C.
C = (X.T.dot(X))/N # C is a convariance matrix.
vals,vecs = np.linalg.eigh(C) # Get the eigenvalues and eigenvectors of C.
vals = vals[::-1] # Make eigenvalues from large to small.
vecs = vecs[:,::-1] # Make eigenvectors from large to small.
U = vecs[:,:r] # U only need the first r columns of vecs matrix.

from sklearn.decomposition import PCA
model = PCA(r)
X_new = model.fit_transform(X)

u,s,v = np.linalg.svd(X) # s is an array of singular values (from large to small).
Ans = np.allclose(s[:r],model.singular_values_) 
# If first r of singular values of the shifted X and model.singular_values_ are same (very close), Ans will equal True.
print(Ans)