# PCA with scikit learn

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.decomposition import PCA

## Code
```python
from sklearn.decomposition import PCA
model = PCA(<parameters>)
X_new = model.fit_transform(X)
```

[Official Reference](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)

## Parameters
- `n_components`: target dimension

## Attributes
- `n_samples`: height of `X`
- `n_features`: width of `X`
- `n_components_`: target dimension
- `components_`: `n_components` rows of principal components
- `mean_`: `X.mean(axis=0)`
- `explained_variance_`: importance of each component
- `explained_variance_ratio_`: importance of each component in ratio
- `singular_values_`: singular values of shifted `X`  
(`singular_values_**2 / n_samples_ == explained_variance_`)  

## Sample data

##### Exercise 1
Let  
```python
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)
```
Let `X_new` be the result of PCA on `X` .

In [None]:
from sklearn.decomposition import PCA
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)

model = PCA()
X_new = model.fit_transform(X)

###### 1(a)
Plot points (rows) in `X` .  
Plot points (rows) in `X_new` .

In [None]:
fig, ax = plt.subplots()
ax.scatter(X[:,0], X[:,1]) # blue
ax.scatter(X_new[:,0], X_new[:,1]) # orange
plt.legend(['X','X_new'])

##### 1(b)
Adding on top of the previous figure, draw vectors for the rows in `model.components_` with the tails at `model.mean_` .

In [None]:
fig, ax = plt.subplots()
ax.scatter(X[:,0], X[:,1]) # blue
ax.scatter(X_new[:,0], X_new[:,1]) # orange
ax.arrow(* model.mean_,*model.components_[0],color='red',head_width=0.2)
ax.arrow(*model.mean_,*model.components_[1],color='red',head_width=0.2)
plt.axis('equal')
plt.legend(['X','X_new'])
## two principal components are orthogonal.

###### 1(c)
Print `model.explained_variance_ratio_` .  
How important is the first component in percentage?

In [None]:
model.explained_variance_ratio_
print(model.explained_variance_ratio_[0])
#The model can explain around 96% of the variance in the dataset.
#The larger the variance explained by a principal component, the more important that component is.

around 96%

##### Exercise 2
Let  
```python
X = np.genfromtxt('hidden_text.csv', delimiter=',')
```
This data has all its points lie in a two-dimensional plane embedded in a much higher dimension.  

In [None]:
X = np.genfromtxt('hidden_text.csv', delimiter=',')

###### 2(a)
Can you find out what does this data say?

In [None]:
X.shape
# it cannot see any info
# So we need to use other aspect to see it.
# problem -> in high-dimension?

In [None]:
plt.scatter(X[:, 0],X[:, 1])
# we can see that there is a text?
# it need to rotate ?
# and we think that using 2-dim can catch the most outline in data  

In [None]:
model = PCA(n_components=2)
X_new = model.fit_transform(X)
plt.scatter(X_new[:, 0],X_new[:, 1])
plt.axis('equal')
# when we do dimensionality reduction to 2-dim we can see the info, which graph shows NSYSU.

###### 2(b)
How much does the first two components have explained in ratio?

In [None]:
print(model.explained_variance_ratio_)
# and using 2-dim ,it can explain ablut 99.9% info in this data.

##### Exercise 3
Let  
```python
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target
```

In [None]:
from sklearn.datasets import load_digits
digits = load_digits()
X = digits.data
y = digits.target

###### 3(a)
Let `X_new` be the result of applying PCA to `X` with 30 components.  
Use the first two columns as the $x$,$y$-coordinates and plot the points with `c=y` .

In [None]:
### your answer here
model = PCA(n_components=30)
X_new = model.fit_transform(X)
plt.scatter(X_new[:,0],X_new[:,1], c=y)

###### 3(b)
Use the first three columns as the $x$,$y$,$z$-coordinates and plot the points with `c=y` .

In [None]:
### your answer here
from mpl_toolkits import mplot3d
plt.axes(projection = '3d')
plt.scatter(X_new[:,0],X_new[:,1],X_new[:,2], c=y)

###### 3(c)
Use `plt.plot` to plot `model.explained_variance_` .  
What is an appropriate choice of the target dimension?

In [None]:
### your answer here
plt.plot(model.explained_variance_)
#10 is appropriate

## Experiments

##### Exercise 4
Let  
```python
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)

model = PCA(2)
X_new = model.fit_transform(X)
```

In [None]:
mu = np.array([3,4])
cov = np.array([[1.1,1],
                [1,1.1]])
X = np.random.multivariate_normal(mu, cov, 100)

model = PCA(2)
X_new = model.fit_transform(X)

###### 4(a)
Let  
```python
X_shifted = X - model.mean_
```
Plot the points (rows) of `X` .  
Plot the points (rows) of `X_shifted` .

In [None]:
X_shifted = X - model.mean_

In [None]:
fig, ax = plt.subplots()
ax.scatter(X[:,0], X[:,1]) # blue
ax.scatter(X_shifted[:,0], X_shifted[:,1]) # orange
plt.legend(['X', 'X_shifted'])


###### 4(b)
Check the rows of `model.components_` are mutually orthogonal and of length 1.

In [None]:
model.components_

In [None]:
# In linear algbra, we know that if components are mutually orthogonal, 
# which means that its inner product is 0.
print(model.components_[0].dot(model.components_[1]))

In [None]:
len = np.linalg.norm(model.components_, axis = 1)
print(len)

###### 4(c)
Let  
```python
X_proj = X_shifted.dot(model.components_.T)
```
Plot the points (rows) of `X_shifted` .
Plot the points (rows) of `X_proj` .

In [None]:
X_proj = X_shifted.dot(model.components_.T)

In [None]:
fig, ax = plt.subplots()
ax.scatter(X_shifted[:,0], X_shifted[:,1]) # blue
ax.scatter(X_proj[:,0], X_proj[:,1]) # orange
plt.legend(['X_shifted','X_proj'])

###### 4(d)
Suppose `model.mean_` and `model.components_` are given.  
Can you find a way to obtained `X` from `X_new` ?

In [None]:
# Compute the inverse transformation of X_new to obtain X
X_recover = np.dot(X_new, model.components_) + model.mean_
model.components_

In [None]:
fig, ax = plt.subplots()
ax.scatter(X_shifted[:,0], X_shifted[:,1]) # blue
ax.scatter(X_recover[:,0], X_recover[:,1]) # orange
plt.legend(['X_shifted','X_recover'])

In [None]:
model.inverse_transform(X_new)

##### Veronica  

It will be better to plot the graph from the following codes to check whether the original `X` and `X_recover` are close or not.

```python
%matplotlib inline
plt.scatter(*X_recover.T,color='blue',s=50)
plt.scatter(*X.T,color='red',s=10)
```