# PCA from scratch

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Algorithm
**Input:**  
- `X`: an array of shape `(N,d)` whose rows are samples and columns are features
- `r`: target dimension

**Output:**
- an array of shape `(N, r)`  

**Steps:**
1. Shift $X$ so that the rows of $X$ are centered at the origin.  
2. Let $C = \frac{1}{N}X^\top X$.  
3. Suppose ${\bf u}_1,\ldots, {\bf u}_d$ form an orthonormal eigenbasis of $C$ corresponding to the eigenvalues $\lambda_1\geq\cdots\geq\lambda_d$.  Let $U$ be the matrix whose columns are ${\bf u}_1,\ldots, {\bf u}_r$.  
4. Return $XU$.

## Pseudocode
Translate the algorithm into the pseudocode.  
This helps you to identify the parts that you don't know how to do it.  

    1. X=X-X.mean
    2. C = (X.T.dot(X))/N
    3. vals,vecs = np.linalg.eigh(C)
    4. vecs = vecs[:,::-1]
    5. U = vecs[:,:r]
    6. return X@U

## Code

In [None]:
def my_pca(X,r=2):
    X = X - np.mean(X,axis=0)
    N = X.shape[0]
    C = (X.T.dot(X))/N
    vals,vecs = np.linalg.eigh(C)
    vals = vals[::-1]  
    vecs = vecs[:,::-1] 
    U = vecs[:,:r]  
    return X@U #XU
X = np.genfromtxt('hidden_text.csv', delimiter=',')
plt.scatter(my_pca(X)[:,0],my_pca(X)[:,1])
plt.axis('equal')

##### Jephian:
Only put the definition of the function here.  
You may test your function later.  

Also, it is inefficient to do `my_pca` twice in 
```python
plt.scatter(my_pca(X)[:,0],my_pca(X)[:,1])
```

## Test
Take some sample data from [PCA-with-scikit-learn](PCA-with-scikit-learn.ipynb) and check if your code generates similar outputs with the existing packages.

##### Name of the data
Description of the data.

##### Jephian:
Put your data and its description here.  
Here is an example:  

We will load the data from `hiddent_text.csv`.  
It contains 1261 points in $\mathbb{R}^{100}$;  
however, these points lives in a two-dimensional space in $\mathbb{R}^{100}$.

```python
X = np.genfromtxt('hidden_text.csv', delimiter=',')
```

In [None]:
### results with your code
r = 2
X = X - X.mean(axis=0)
N = X.shape[0]
C = (1/N)*(X.T)@X
vals,vecs = np.linalg.eigh(C)
vals = vals[::-1]
vecs = vecs[:,::-1]
U = vecs[:,:r]
##此筆data的components會差一個負號
U[:,1]=-U[:,1]

plt.scatter(X@U[:,0],X@U[:,1])
plt.axis('equal')

##### Jephian:
Since you have the function already, use it!  

For example,  
```python
X_new = my_pca(X)
plt.axis('equal')
plt.scatter(*X_new.T)
```

No need to manually do `U[:,1]=-U[:,1]` .

In [None]:
### results with existing packages
from sklearn.decomposition import PCA
model = PCA(n_components=2)
X_new = model.fit_transform(X)
plt.scatter(X_new[:,0],X_new[:,1])
plt.axis('equal')

## Comparison

##### Exercise 1
The center of rows of $X$ (before shift) is supposed to be `model.mean_` .  
Check if this is true.

In [None]:
### your answer here
print((model.mean_==X.mean(axis=0)).all())
#true

##### Jephian:
Make this cell independent from the other cell.  

```python
X = np.genfromtxt('hidden_text.csv', delimiter=',')

from sklearn.decomposition import PCA
model = PCA(n_components=2)
X_new = model.fit_transform(X)

print((model.mean_==X.mean(axis=0)).all())
```

##### Exercise 2
The matrix $U^\top$ is supposed to be `model.components_` .  
(Up to some negations.)  
Check if this is true.

In [None]:
### your answer here
print(U.T.shape)
print(model.components_.shape)
print(np.allclose(U.T,model.components_))
#true

##### Jephian:
Again, make this cell independent.  

```python
X = np.genfromtxt('hidden_text.csv', delimiter=',')

r = 2
N = X.shape[0]
X_shift = X - X.mean(axis=0)
C = (1/N)*(X_shift.T)@X_shift
vals,vecs = np.linalg.eigh(C)
vals = vals[::-1]
vecs = vecs[:,::-1]
U = vecs[:,:r]

from sklearn.decomposition import PCA
model = PCA(n_components=2)
X_new = model.fit_transform(X)

np.set_printoptions(precision=2, suppress=True)
for i in range(model.components_.shape[0]):
    u,c = U.T[i],model.components_[i]
    if np.isclose(u,c).all() or np.isclose(u,-c).all():
        print(True)
```

##### Exercise 3
The sequence $\lambda_1,\ldots,\lambda_r$ are suppose to be the values in `explained_variance_` .  
Check if this is true.

In [None]:
### your answer here
print([vals[0],vals[1]])
print(model.explained_variance_)
print(np.allclose([vals[0],vals[1]],model.explained_variance_))
#false
print(np.allclose([vals[0],vals[1]],model.explained_variance_,atol=20))
#true

##### Jephian:

```python
X = np.genfromtxt('hidden_text.csv', delimiter=',')

r = 2
N = X.shape[0]
X_shift = X - X.mean(axis=0)
C = (1/N)*(X_shift.T)@X_shift
vals,vecs = np.linalg.eigh(C)
vals = vals[::-1]
vecs = vecs[:,::-1]
U = vecs[:,:r]

from sklearn.decomposition import PCA
model = PCA(n_components=2)
X_new = model.fit_transform(X)

print(np.isclose(vals[:2],model.explained_variance_,atol=20).all())
```

##### Exercise 4
Let $t = \operatorname{tr}(C)$.  
The sequence $\lambda_1/t,\ldots,\lambda_r/t$ are suppose to be the values in `explained_variance_ratio_` .  
Check if this is true.

In [None]:
### your answer here
t = np.trace(C)
print([vals[0]/t,vals[1]/t])
print(model.explained_variance_ratio_)
print(np.allclose([vals[0]/t,vals[1]/t],model.explained_variance_ratio_))
#true

##### Jephian:

```python
X = np.genfromtxt('hidden_text.csv', delimiter=',')

r = 2
N = X.shape[0]
X_shift = X - X.mean(axis=0)
C = (1/N)*(X_shift.T)@X_shift
vals,vecs = np.linalg.eigh(C)
vals = vals[::-1]
vecs = vecs[:,::-1]
U = vecs[:,:r]

from sklearn.decomposition import PCA
model = PCA(n_components=2)
X_new = model.fit_transform(X)

t = np.trace(C)
print(np.isclose(vals[:2] / t, model.explained_variance_ratio_).all())
```

##### Exercise 5
The singular values of the shifted `X` are supposed to be `model.singular_values_` .  
Check if this is true.

In [None]:
### your answer here
from scipy.linalg import svd
u, s, VT = svd(X)
print(s[:2])
print(model.singular_values_)
print(np.allclose(s[:2],model.singular_values_))

##### Jephian:

```python
X = np.genfromtxt('hidden_text.csv', delimiter=',')

r = 2
N = X.shape[0]
X_shift = X - X.mean(axis=0)
u, s, vh = np.linalg.svd(X_shift)

from sklearn.decomposition import PCA
model = PCA(n_components=2)
X_new = model.fit_transform(X)

print(np.allclose(s[:2],model.singular_values_))
```