# PCA with scikit learn

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Code
```python
from sklearn.decomposition import PCA ## import model
model = PCA(<parameters>) ## parameters : some settings
X_new = model.fit_transform(X) ## transform data to new data
```

[Official Reference](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

## Parameters
- `n_components`: target dimension

## Attributes
```python
n_samples: height of `X` ## numbers of rows
n_features: width of `X` ## numbers of columns
n_components_: target dimension ## r
components_: `n_components` rows of principal components ##  r *　d
mean_: `X.mean(axis=0)`
explained_variance_: importance of each component
explained_variance_ratio_: importance of each component in ratio
singular_values_: singular values of shifted `X` ##
(`singular_values_**2 / n_samples_ == explained_variance_`)  
```

## Sample data

##### Exercise 1
Let  
```python
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)
```
Let `X_new` be the result of PCA on `X` .

In [None]:
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)

In [None]:
X.shape

In [None]:
from sklearn.decomposition import PCA
model = PCA()
X_new = model.fit_transform(X)

In [None]:
X_new.shape

###### 1(a)
Plot points (rows) in `X` .  
Plot points (rows) in `X_new` .

In [None]:
%matplotlib inline
plt.axis('equal')
plt.scatter(*X.T,color='black') ## X
plt.scatter(*X_new.T,color='blue') ## X_new
## X_new for here is move to around origin and rotate

##### 1(b)
Adding on top of the previous figure, draw vectors for the rows in `model.components_` with the tails at `model.mean_` .

In [None]:
model.components_

In [None]:
%matplotlib inline
plt.axis('equal')
plt.scatter(*X.T,color='black') ## X
plt.scatter(*X_new.T,color='blue') ## X_new
plt.arrow(*model.mean_,*model.components_[0],color='red',head_width=0.2)
plt.arrow(*model.mean_,*model.components_[1],color='red',head_width=0.2)
## two principal components are perpendicular(mutually orthogonal).

###### 1(c)
Print `model.explained_variance_ratio_` .  
How important is the first component in percentage?

In [None]:
model.explained_variance_ratio_

In [None]:
print('The first component in percentage is ',model.explained_variance_ratio_[0]*100,'%')

##### Exercise 2
Let  
```python
X = np.genfromtxt('hidden_text.csv', delimiter=',')
```
This data has all its points lie in a two-dimensional plane embedded in a much higher dimension.  

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')
X.shape ## in a high dim ( 100-dim ) ## 1261*100

In [None]:
from sklearn.decomposition import PCA
model = PCA(2)
X_new = model.fit_transform(X)
X_new.shape ## in a 2-dim ## 1261*2

###### 2(a)
Can you find out what does this data say?

In [None]:
%matplotlib inline
plt.axis('equal')
plt.scatter(*X_new.T,color='blue')
print('The data is said NSYSU.')

###### 2(b)
How much does the first two components have explained in ratio?

In [None]:
model.explained_variance_ratio_

In [None]:
print('The first component in percentage is ',model.explained_variance_ratio_[0]*100,'%')
print('The second component in percentage is ',model.explained_variance_ratio_[1]*100,'%')

##### Exercise 3
Let  
```python
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target
```

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target

In [None]:
X.shape ## 1797*64

###### 3(a)
Let `X_new` be the result of applying PCA to `X` with 30 components.  
Use the first two columns as the $x$,$y$-coordinates and plot the points with `c=y` .

In [None]:
from sklearn.decomposition import PCA
model = PCA(30)
X_new = model.fit_transform(X)
X_new.shape

In [None]:
## use the first two columns
%matplotlib inline
plt.axis('equal')
plt.scatter(X_new[:,0],X_new[:,1],c=y)
## plt.scatter(X_new.T[0],X_new.T[1],c=y)

###### 3(b)

Use the first three columns as the $x$,$y$,$z$-coordinates and plot the points with `c=y` .

In [None]:
%matplotlib notebook
ax = plt.axes(projection='3d')
ax.scatter(X_new[:,0],X_new[:,1],X_new[:,2],c=y)
## ax.scatter(X_new.T[0],X_new.T[1],X_new.T[2],c=y)

###### 3(c)
Use `plt.plot` to plot `model.explained_variance_` .  
What is an appropriate choice of the target dimension?

In [None]:
%matplotlib inline
plt.plot(model.explained_variance_)

In [None]:
%matplotlib inline
c = model.explained_variance_
c = np.cumsum(c)/np.sum(c)
## numpy.cumsum(arr, axis=None, dtype=None, out=None) : cumulation
plt.plot(c)
## depend on the accuracy we want

## Experiments

##### Exercise 4
Let  
```python
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)

model = PCA(2)
X_new = model.fit_transform(X)
```

In [None]:
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)

model = PCA(2)
X_new = model.fit_transform(X)

In [None]:
X.shape

In [None]:
X_new.shape

###### 4(a)
Let  
```python
X_shifted = X - model.mean_
```
Plot the points (rows) of `X` .  
Plot the points (rows) of `X_shifted` .

In [None]:
X_shifted = X - model.mean_
%matplotlib inline
plt.axis('equal')
plt.scatter(*X.T,color='black') ## X
plt.scatter(*X_shifted.T,color='blue') ## X_shifted ## Shift the data to let the data average is zero

###### 4(b)
Check the rows of `model.components_` are mutually orthogonal and of length 1.

In [None]:
model.components_

In [None]:
model.components_.dot(model.components_.T)

In [None]:
np.isclose(model.components_.dot(model.components_.T),np.eye(2)).all()
## The rows of model.components_ are mutually orthogonal and of length 1.

###### 4(c)
Let  
```python
X_proj = X_shifted.dot(model.components_.T)
```
Plot the points (rows) of `X_shifted` .
Plot the points (rows) of `X_proj` .

In [None]:
X_proj = X_shifted.dot(model.components_.T)

In [None]:
%matplotlib inline
plt.axis('equal')
plt.scatter(*X_shifted.T,color='black') # X_shifted
plt.scatter(*X_proj.T,color='green') # X_proj

In [None]:
%matplotlib inline
plt.axis('equal')
plt.scatter(*X_shifted.T,color='black') # X_shifted
plt.scatter(*X_new.T,color='blue') # X_new
## X_new(blue) is the same as X_proj(green)

###### 4(d)
Suppose `model.mean_` and `model.components_` are given.  
Can you find a way to obtained `X` from `X_new` ?

In [None]:
model.mean_

In [None]:
model.components_

In [None]:
inv = np.linalg.inv(model.components_.T)
X_recover = X_new.dot(inv) + model.mean_
np.isclose(X_recover,X).all()

In [None]:
%matplotlib inline
plt.scatter(*X_recover.T,color='blue',s=50)
plt.scatter(*X.T,color='black',s=10)

##### Alex:
In fact, `np.linalg.inv(model.components_.T)` is the same as `model.components_`.