# Principal Component Analysis

PCA isn't exactly full machine learning algorithm, but instead an unsupervised learning algorithm. It is often used to **preprocess** data before it goes into a supervised learning method. Traditionally it is used to solve problems involving too many features and multicolinearity. 

## Let's dig into how it works 
Suppose you have $p$ feature columns. The **first principal component** is a linear combination of all $p$ columns that accounts for the **maximum variance** among them.  That is,

$$ z_1 = c_{11}x_1 + c_{12}x_2 + c_{13}x_3 + ... c_{1n}$$

The **second principal component** is another linear combination of the $p$ features that accounts for the maximum of the _remaining_ variance after the first. Another condition is that the second PC must be **orthogonal (perpendicular)** to the first.

$$ z_2 = c_{21}x_1 + c_{22}x_2 + c_{23}x_3 + ... c_{2n}$$

The **third principal component** maximizes the remaining variance while being orthogonal (read: _independent_) to the first two, and so on.


$$ z_3 = c_{31}x_1 + c_{32}x_2 + c_{33}x_3 + ... c_{3n}$$
$$...$$
$$ z_i = c_{i1}x_1 + c_{i2}x_2 + c_{i3}x_3 + ... c_{in}$$

![MachineLearning](assets/PCA.png)

Hands on visuals are a great way to learn. Here is an interactive website to help you understand how [PCA works](http://setosa.io/ev/principal-component-analysis/)

## Libraries

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline

## The Data

Let's work with the cancer data set again since it had so many features.

In [None]:
from sklearn.datasets import load_breast_cancer

In [None]:
cancer = load_breast_cancer()

In [None]:
cancer.keys()

In [None]:
print(cancer['DESCR'])

In [None]:
X = pd.DataFrame(cancer['data'],columns=cancer['feature_names'])
#(['DESCR', 'data', 'feature_names', 'target_names', 'target'])
ydf = pd.DataFrame(cancer['target'],columns=['malignant'])

In [None]:
y=cancer.target

In [None]:
df=pd.concat([X,ydf],axis=1)

In [None]:
df.head()

## PCA Visualization

As we've noticed before it is difficult to visualize high dimensional data, we can use PCA to find the first two principal components, and visualize the data in this new, two-dimensional space, with a single scatter-plot. Before we do this though, we'll need to scale our data so that each feature has a single unit variance.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
scaler.fit(X)

In [None]:
scaled_data = scaler.transform(X)

PCA with Scikit Learn uses a very similar process to other preprocessing functions that come with SciKit Learn. We instantiate a PCA object, find the principal components using the fit method, then apply the rotation and dimensionality reduction by calling transform().

We can also specify how many components we want to keep when creating the PCA object.

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca = PCA(n_components=2)

In [None]:
pca.fit(scaled_data)

Now we can transform this data to its first 2 principal components.

In [None]:
x_pca = pca.transform(scaled_data)

In [None]:
scaled_data.shape

In [None]:
x_pca.shape

In [None]:
feat_import=pd.DataFrame(pca.components_,columns=cancer['feature_names'],index = ['PC-1','PC-2'])
feat_import

Great! We've reduced 30 dimensions to just 2! Let's plot these two dimensions out!

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=cancer['target'],cmap='plasma')
plt.xlabel('First principal component')
plt.ylabel('Second Principal Component')

Clearly by using these two components we can easily separate these two classes.

## Interpreting the components 

Unfortunately, with this great power of dimensionality reduction, comes the cost of being able to easily understand what these components represent.

The components correspond to combinations of the original features, the components themselves are stored as an attribute of the fitted PCA object:

In [None]:
pca.components_

In [None]:
list(zip(pca.components_[0],df.columns))

In this numpy matrix array, each row represents a principal component, and each column relates back to the original features. Fir example, here is the first component:

$$z_1 = 0.21890244x_1 + 0.10372458x_2 + 0.22753729x_3 + 0.22099499x_4 + 0.14258969x_5 + ... + 0.13178394x_{30}$$

we can visualize this relationship with a heatmap:

In [None]:
df_comp = pd.DataFrame(pca.components_,columns=cancer['feature_names'])

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(df_comp,cmap='plasma',)

This heatmap shows the how each variable contributes to each of our two principle components.

## What's the right number of n_components?

Finding the right number with PCA can be accomplished by applying it to a model and seeing where it best maximizing it's score

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# How does it look with the whole feature set?

lg=LogisticRegression()
lg.fit(X,y)
lg.score(X,y)

In [None]:
# How about reducing it to our two pca features?
lg=LogisticRegression()
lg.fit(x_pca,y)
lg.score(x_pca,y)

#### What does this tell us?

...

## Introducing Pipeline

To find the right value of n_components we could cycle through this a few times. However - to save us some time lets introduce the sklearn package - [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

Pipeline sequentially applies a list of transforms and passes them to a final estimator. Intermediate steps of the pipeline must be ‘transforms’, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

Included **Methods**

| Method                     | Application   |
|:---------------------------|------------:|
|decision_function(X)        |	Apply transforms, and decision_function of the final estimator|
|fit(X[, y])                 |	Fit the model|
|fit_predict(X[, y])         |	Applies fit_predict of last step in pipeline after transforms.|
|fit_transform(X[, y])       |	Fit the model and transform with the final estimator|
|get_params([deep])          |	Get parameters for this estimator.|
|predict(X)                  |	Apply transforms to the data, and predict with the final estimator|
|predict_log_proba(X)        |	Apply transforms, and predict_log_proba of the final estimator|
|predict_proba(X)            |	Apply transforms, and predict_proba of the final estimator|
|score(X[, y, sample_weight])|	Apply transforms, and score with the final estimator|
|set_params(**kwargs)        |	Set the parameters of this estimator.|

In [None]:
# Optimizing for variance of n_components could take awhile. 
# Let's make our coding easier by putting all these models into a pipeline

from sklearn.pipeline import Pipeline
pipe = Pipeline([
    ('sc', StandardScaler()),
    ('pc', PCA(n_components=2)),
    ('lg', LogisticRegression())
])

#### What will the Standard Scalar do to our data before we fit each model?

...

In [None]:
pipe.fit(X,y)
pipe.score(X,y)

In [None]:
#What's our model look like?
pipe.get_params

#### We can use the train/test split procedure from the previous lesson to see how each of these 30 versions performs.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y, random_state=1)

In [None]:
# This for loop goes through every number of possible components and tests the accuracy of a model fitted to that many components.

acc_list = []
k_range = range(1,X.shape[1] + 1)
for k in k_range:
    pipe.set_params(pc__n_components=k)
    pipe.fit(X_train, y_train)
    acc = pipe.score(X_test, y_test)
    acc_list.append(acc)
    print(f"k = {k}: Acc = {acc}")

In [None]:
plt.figure(figsize=(10,8))
plt.plot(acc_list);
plt.ylabel('Accuracy - higher is better')
plt.xlabel('Number of Components Included');

### Which number of n_components would you choose? why?

#### Hint: what were we trying to reduce with PCA?

In [None]:
# we can visualize the list
k_list=dict(zip(k_range,acc_list))
k_list

In [None]:
# We can also find the greatest variance explained (for the first time)

print(sorted(k_list.items(), key=lambda x: (x[1]), reverse=True)[0])
print(sorted(k_list.items(), key=lambda x: (x[1]), reverse=True)[0][0])
best_k=sorted(k_list.items(), key=lambda x: (x[1]), reverse=True)[0][0]

In [None]:
#What happens when we add that back into our pipe?
pipe_best_k = Pipeline([
    ('sc', StandardScaler()),
    ('pc', PCA(n_components=[Insert the value you want])),
    ('lg', LogisticRegression())
])
pipe_best_k.fit(X,y)
pipe_best_k.score(X,y)

## Instead of setting n_components

Another approach to PCA is setting it against the amount of variance you want to explain

**Note - need a new dataset for this one!**

In [None]:
# The first approch is to see how many component explain variance. After the first 4... not so much
pca_graph=PCA().fit(cancer.data)
plt.plot(np.cumsum(pca_graph.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

In [None]:
# Let's set a model that'll explain 95% of variance
pca=PCA(.95)
pca.fit(X_train)
pca.n_components_

#### Question: Why did our model only include one component?  Would the model be more or less accurate if more components were included?

...

# Summary:

### What does it do?
* Creates *linearly independent* predictors
* Allows you to only use the most valuable features

### What are the components?
* Values calculated from the raw observations:

$$ z_1 = c_{11}x_1 + c_{12}x_2 + c_{13}x_3 + ... c_{1n}$$
$$ z_2 = c_{21}x_1 + c_{22}x_2 + c_{23}x_3 + ... c_{2n}$$
$$...$$
$$ z_i = c_{i1}x_1 + c_{i2}x_2 + c_{i3}x_3 + ... c_{in}$$

* $z_1$ is always the strongest predictor. 
* $z_2$ is always the strongest predictor that is completely independent from $z_1$.  
* $z_3$ is always the strongest predictor that is completely independent from $z_1$ and $z_2$.  
* We can keep this going until the number of compoenents equals the number of predictors

### Why do we exclude components from our model?
* If we included all components, we would be getting the exact same result as if we hadn't used PCA
* By excluding components we can avoid overfitting the model, we are essentially ignoring the information that is least reliable.

### Points to Remember
* PCA is used to overcome features redundancy in a data set.
* These features are low dimensional in nature.
* These features a.k.a components are a resultant of normalized linear combination of original predictor variables.
* These components aim to capture as much information as possible with high explained variance.
* The first component has the highest variance followed by second, third and so on.
* The components must be uncorrelated (remember orthogonal direction ? ). See above.
* Normalizing data becomes extremely important when the predictors are measured in different units.
* PCA works best on data set having 3 or higher dimensions. Because, with higher dimensions, it becomes increasingly difficult to make interpretations from the resultant cloud of data.
* PCA is applied on a data set with numeric variables.
* PCA is a tool which helps to produce better visualizations of high dimensional data.