# PCA with scikit learn

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from mpl_toolkits import mplot3d

## Code
```python
from sklearn.decomposition import PCA
model = PCA(<parameters>)
X_new = model.fit_transform(X)
```

[Official Reference](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

## Parameters
- `n_components`: target dimension

## Attributes
- `n_samples`: height of `X`
- `n_features`: width of `X`
- `n_components_`: target dimension
- `components_`: `n_components` rows of principal components
- `mean_`: `X.mean(axis=0)`
- `explained_variance_`: importance of each component
- `explained_variance_ratio_`: importance of each component in ratio
- `singular_values_`: singular values of shifted `X`  
(`singular_values_**2 / n_samples_ == explained_variance_`)  

## Sample data

##### Exercise 1
Let  
```python
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)
```
Let `X_new` be the result of PCA on `X` .

In [None]:
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)

model = PCA(2)
X_new = model.fit_transform(X)

###### 1(a)
Plot points (rows) in `X` .  
Plot points (rows) in `X_new` .

In [None]:
### your answer here
fig, ax = plt.subplots()
ax.scatter(X[:,0], X[:,1], c = 'r')
ax.scatter(X_new[:,0], X_new[:,1], c = 'b')
#the red points is before 
#the blue points is after

##### 1(b)
Adding on top of the previous figure, draw vectors for the rows in `model.components_` with the tails at `model.mean_` .

In [None]:
### your answer here
ax.arrow(*model.mean_, *(model.components_[0]), width = 0.05, length_includes_head = True)
ax.arrow(*model.mean_, *(model.components_[1]), width = 0.05, length_includes_head = True)
fig

##### Jephian:
Adding `plt.axis('equal')` is recommended so that you can see the principal components are mutually orthogonal.

###### 1(c)
Print `model.explained_variance_ratio_` .  
How important is the first component in percentage?

In [None]:
### your answer here
print("importance of first component : %.4f%%" % (model.explained_variance_ratio_[0]*100))

##### Exercise 2
Let  
```python
X = np.genfromtxt('hidden_text.csv', delimiter=',')
```
This data has all its points lie in a two-dimensional plane embedded in a much higher dimension.  

###### 2(a)
Can you find out what does this data say?

In [None]:
### your answer here
X = np.genfromtxt('hidden_text.csv', delimiter=',')

model = PCA(n_components=2)
X_new = model.fit_transform(X)


plt.axis("equal")
plt.scatter(X_new[:, 0], X_new[:, 1])

In [None]:
'NSYSU'

###### 2(b)
How much does the first two components have explained in ratio?

In [None]:
### your answer here
print("Ans :\nimportance of component1 %.4f%%\nimportance of component2 %.4f%%" % (model.explained_variance_ratio_[0]*100, model.explained_variance_ratio_[1]*100))

##### Jephian:
To make the code readable, you may do  
```python
print("""Ans :
importance of component1 %.4f
importance of component2 %.4f"""
      %(model.explained_variance_ratio_[0]*100, 
        model.explained_variance_ratio_[1]*100))
```

##### Exercise 3
Let  
```python
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target
```

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target

###### 3(a)
Let `X_new` be the result of applying PCA to `X` with 30 components.  
Use the first two columns as the $x$,$y$-coordinates and plot the points with `c=y` .

In [None]:
### your answer here
model = PCA(n_components=30)
X_new = model.fit_transform(X)
plt.scatter(X_new[:,0], X_new[:,1], c = y)

###### 3(b)
Use the first three columns as the $x$,$y$,$z$-coordinates and plot the points with `c=y` .

In [None]:
### your answer here
%matplotlib notebook
ax = plt.axes(projection='3d')
ax.scatter(X_new[:,0], X_new[:,1], X_new[:,2], c = y)

###### 3(c)
Use `plt.plot` to plot `model.explained_variance_` .  
What is an appropriate choice of the target dimension?

In [None]:
### your answer here
%matplotlib inline
n = len(model.explained_variance_)
plt.plot([i for i in range(n)], model.explained_variance_)
#we plot the explained_variance_ 
#by point (0, explained_variance_[0]), (1, explained_variance_[1]), ..., (29, explained_variance_[29])

##### Jephian:
In fact, the following two lines have the same result.
```python
plt.plot([i for i in range(n)], model.explained_variance_)
plt.plot(model.explained_variance_)
```

In [None]:
#and we plot the cumulative percentage of each explained_variance_
cumu = model.explained_variance_
cumu = np.cumsum(cumu)/np.sum(cumu)
plt.plot([i for i in range(n)], cumu)
#we can see that in the tail of the plot, it doesn`t improve much
#suppose we only need 80% of the information, we only need to pick dimention = 10

##### Jephian:
Nice!

## Experiments

##### Exercise 4
Let  
```python
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)

model = PCA(2)
X_new = model.fit_transform(X)
```

In [None]:
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)

model = PCA(2)
X_new = model.fit_transform(X)

###### 4(a)
Let  
```python
X_shifted = X - model.mean_
```
Plot the points (rows) of `X` .  
Plot the points (rows) of `X_shifted` .

In [None]:
### your answer here
%matplotlib inline
X_shifted = X - model.mean_

plt.scatter(X[:,0], X[:,1], c = 'r')
plt.scatter(X_shifted[:,0], X_shifted[:,1], c = 'b')

###### 4(b)
Check the rows of `model.components_` are mutually orthogonal and of length 1.

In [None]:
### your answer here
leng = np.linalg.norm(model.components_, axis = 1)
print("length of row1 :", leng[0])
print("length of row2 :", leng[1])

print("\n<row1, row2> =", np.dot(model.components_[0], model.components_[1]))
#The inner product of row1 and row2 is 0, that is they are orthogonal

###### 4(c)
Let  
```python
X_proj = X_shifted.dot(model.components_.T)
```
Plot the points (rows) of `X_shifted` .
Plot the points (rows) of `X_proj` .

In [None]:
### your answer here
X_proj = X_shifted.dot(model.components_.T)

plt.subplot(121)
plt.scatter(X_shifted[:,0], X_shifted[:,1], c = 'r')
plt.scatter(X_proj[:,0], X_proj[:,1], c = 'b')
plt.title("shifted & proj")
plt.subplot(122)
plt.scatter(X_shifted[:,0], X_shifted[:,1], c = 'r')
plt.scatter(X_new[:,0], X_new[:,1], c = 'g')
plt.title("shifted & new")

#the red points are X_shifted
#the blue points are X_proj, ans it is the points after PCA(dim=2)  (i.e. X_new)
#the green points are X_new

##### Jephian:
Nice.  
You may consider doing `plt.figure(figsize=(6,3))` if necessary.

###### 4(d)
Suppose `model.mean_` and `model.components_` are given.  
Can you find a way to obtained `X` from `X_new` ?

In [None]:
### your answer here
#let U is the matrix model.components_
#since X_new = (X - mean)⋅U
#so X = X_new⋅U^-1 + mean
inv = np.linalg.inv(model.components_.T)
recover_X = X_new.dot(inv)+ model.mean_

plt.subplot(121)
plt.scatter(X[:,0], X[:,1], c = 'r')
plt.title("original X")
plt.subplot(122)
plt.scatter(recover_X[:,0], recover_X[:,1], c = 'b')
plt.title("recover_X")

##### Jephian:
The second line  

    #let U is the matrix model.components_

should be 

    #let U is the matrix model.components_.T

##### Jephian:
Well done!