# Principal Component Analysis

Principal Component Analysis (PCA) is a linear dimensionality reduction technique that can be utilized for extracting information from a high-dimensional space by projecting it into a lower-dimensional sub-space. It tries to preserve the essential parts that have more variation of the data and remove the non-essential parts with fewer variation.

Dimensions are nothing but features that represent the data. For example, A 28 X 28 image has 784 picture elements (pixels) that are the dimensions or features which together represent that image.

One important thing to note about PCA is that it is an Unsupervised dimensionality reduction technique, you can cluster the similar data points based on the feature correlation between them without any supervision (or labels), and you will learn how to achieve this practically using Python in later sections of this tutorial!

One important thing to note about PCA is that it is an Unsupervised dimensionality reduction technique, you can cluster the similar data points based on the feature correlation between them without any supervision (or labels).
PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. Features, Dimensions, and Variables are all referring to the same thing in this notebook. 


#### Main usage of PCA
* Data Visualization
When working on any data related problem, extensive data exploration like finding out how the variables are correlated or understanding the distribution of a few variables is crucial. Considering that there are a large number of variables or dimensions along which the data is distributed, visualization can be a challenge and almost impossible. Using dimensionality reduction, data can be projected into a lower dimension, thereby allowing you to visualize the data in a 2D or 3D space.


* Speeding Machine Learning Algorithm
Since PCA's main idea is dimensionality reduction, you can leverage that to speed up your machine learning algorithm's training and testing time considering your data has a lot of features, and the ML algorithm's learning is too slow.

#### Principal Component
Principal components are the key to PCA; they represent what's underneath the hood of your data. In a layman term, when the data is projected into a lower dimension (assume three dimensions) from a higher space, the three dimensions are nothing but the three Principal Components that captures (or holds) most of the variance (information) of your data.

Principal components have both direction and magnitude. The direction represents across which principal axes the data is mostly spread out or has most variance and the magnitude signifies the amount of variance that Principal Component captures of the data when projected onto that axis. The principal components are a straight line, and the first principal component holds the most variance in the data. Each subsequent principal component is orthogonal to the last and has a lesser variance. In this way, given a set of x correlated variables over y samples you achieve a set of u uncorrelated principal components over the same y samples.

The reason you achieve uncorrelated principal components from the original features is that the correlated features contribute to the same principal component, thereby reducing the original data features into uncorrelated principal components; each representing a different set of correlated features with different amounts of variation.

Each principal component represents a percentage of total variation captured from the data.

#### PCA on iris dataset
In this section we will decompose with PCA very simple 4-dimensional data set. This is ono eg the best known pattern recognition dataset. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston
#from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
#from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso


% matplotlib inline

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

In [None]:
iris_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

In [None]:
# loading dataset into Pandas DataFrame
df_iris = pd.read_csv(iris_url, names=['sepal length', 'sepal width', 'petal length', 'petal width', 'target'])

In [None]:
df_iris.head(15)

In the case that the dimensionality of the data allows it, it is good practice to see how each pair of features correlate with each other. In the followinglink you will find more methods for visualizing multidimensional data using matplotlib and seaborn libraries
https://towardsdatascience.com/the-art-of-effective-visualization-of-multi-dimensional-data-6c7202990c57

In [None]:
sns.pairplot(df_iris, hue='target')

You can immediately see that the features petal length and petal width are strongly correlated


### Standardize the Data

Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales. Although, all features in the Iris dataset were measured in centimeters, let us continue with the transformation of the data onto unit scale (mean=0 and variance=1), which is a requirement for the optimal performance of many machine learning algorithms.


In [None]:
features_iris = ['sepal length', 'sepal width', 'petal length', 'petal width']
x_iris = df_iris.loc[:, features_iris].values

In [None]:
y_iris = df_iris.loc[:, ['target']].values

In [None]:
x_iris = StandardScaler().fit_transform(x_iris)

In [None]:
df_iris_standarize = pd.DataFrame(data=x_iris, columns=features_iris)
df_iris_standarize['target'] = df_iris['target']
df_iris_standarize.head(15)

In [None]:
sns.pairplot(df_iris_standarize, hue='target')

We can see that the distributions are now standardized

### PCA Projection to 2D

In [None]:
pca_iris = PCA(n_components=2)

In [None]:
principalComponents_iris = pca_iris.fit_transform(x_iris)

In [None]:
principalDf_iris = pd.DataFrame(data=principalComponents_iris,
                                columns=['principal component 1', 'principal component 2'])


In [None]:
finalDf_iris = pd.concat([principalDf_iris, df_iris[['target']]], axis=1)
finalDf_iris.head(15)


### Visualize 2D Projection

Use a PCA projection to 2d to visualize the entire data set. You should plot different classes using different colors or shapes.

In [None]:


fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('Principal Component 1', fontsize=15)
ax.set_ylabel('Principal Component 2', fontsize=15)
ax.set_title('2 Component PCA', fontsize=20)

iris_targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(iris_targets, colors):
    indicesToKeep = finalDf_iris['target'] == target
    ax.scatter(finalDf_iris.loc[indicesToKeep, 'principal component 1']
               , finalDf_iris.loc[indicesToKeep, 'principal component 2']
               , c=color
               , s=50)
ax.legend(iris_targets)
ax.grid()



iris-setosa is linearry separablo from others class

### Explained Variance

The explained variance tells us how much information (variance) can be attributed to each of the principal components.

In [None]:
pca_iris.explained_variance_ratio_

Together, the first two principal components contain 95.80% of the information. The first principal component contains 72.77% of the variance and the second principal component contains 23.03% of the variance. The third and fourth principal component contained the rest of the variance of the dataset.

### limitations of PCA

* PCA is not scale invariant. check: we need to scale our data first.
    
* The directions with largest variance are assumed to be of the most interest

* Only considers orthogonal transformations (rotations) of the original variables
 
* PCA is only based on the mean vector and covariance matrix. Some distributions (multivariate normal) are characterized by this, but some are not.

* If the variables are correlated, PCA can achieve dimension reduction. If not, PCA just orders them according to their variances.




### Exercises - Perform PCA for breast cancer dataset

* You can find this dataset it in the scikit learn library, import it and convert to pandas dataframe, original label are '0' and '1' for better readability change these names to: 'benign' and 'malignant'

In [None]:
from sklearn.datasets import load_breast_cancer

In [None]:
def sklearn_dataset_to_pandas(sklearn_dataset, target_map = None):
    cancer_features = sklearn_dataset['data']
    cancer_features_norm = StandardScaler().fit_transform(cancer_features)

    pandas_df = pd.DataFrame(data=cancer_features_norm, columns=sklearn_dataset['feature_names'])
    pandas_df['target'] = pd.Series(sklearn_dataset['target'])

    if target_map is not None:
        pandas_df['target'] = pandas_df['target'].map(lambda t: target_map[t])

    return pandas_df

In [None]:
cancer_dataset = load_breast_cancer()
cancer_target_map = {0: 'benign', 1: 'malignant'}
cancer_df = sklearn_dataset_to_pandas(cancer_dataset, cancer_target_map)
cancer_df.head(5)


* Visualizes correlations between pairs of features (due to the greater number of features use pandas corr () function instead of pairplot instead of seaborn heatmap ())

In [None]:
cancer_df.corr()

* Perform PCA and visualize the data

In [None]:
def create_pca(n_components, df):
    pca = PCA(n_components=n_components)
    pca_components = pca.fit_transform(df.loc[:, df.columns != 'target'])

    columns_names = ['Component ' + str(i) for i in range(1, n_components+1)]
    pca_df = pd.DataFrame(data=pca_components, columns=columns_names)
    pca_df = pd.concat([pca_df, df['target']], axis=1)

    return pca_df, pca

In [None]:
def visualize_pca(df):
    targets = df['target'].unique()
    fig = plt.figure(figsize=(8, 8))
    ax = fig.add_subplot(1, 1, 1)
    ax.set_xlabel('Component 1', fontsize=15)
    ax.set_ylabel('Component 2', fontsize=15)
    ax.set_title('2 Component PCA', fontsize=20)

    for target, color in zip(targets, colors):
        indicator = df['target'] == target
        ax.scatter(
            df.loc[indicator, 'Component 1'],
            df.loc[indicator, 'Component 2'],
            cmap='PuOr', s=50)

    ax.legend(targets)
    ax.grid()

In [None]:
pca_cancer_df, pca_cancer = create_pca(2, cancer_df)
visualize_pca(pca_cancer_df)

* Examine  explained variance, draw a plot showing relation between total explained variance and number of principal components used


In [None]:
def compare_pc(df):
    x, y = [], []
    for i in range(2, 8):
        pca_df, pca = create_pca(i, df)
        total_variance = np.sum(pca.explained_variance_ratio_)
        x.append(i)
        y.append(total_variance)

    plt.plot(x, y)
    plt.show()

In [None]:
compare_pc(cancer_df)

* Use recursive feature elimination (available in scikit-learn module) or another feature ranking algorithm to split 30 features to on 15 "more important" and "less important" features. Then repeat the last step from the full data set - draw a plot showing relation between total explained variance and number of principal components used for all 3 cases. Explain the result briefly.

In [None]:
from sklearn.feature_selection import RFE
from sklearn.tree import DecisionTreeClassifier
from operator import itemgetter

In [None]:
rfe = RFE(estimator=DecisionTreeClassifier(), n_features_to_select=15)
rfe.fit(cancer_df[cancer_dataset['feature_names']].values, cancer_df['target'])

In [None]:
sorted_predictors_names = []
for x, y in (sorted(zip(rfe.ranking_ , cancer_dataset['feature_names']), key=itemgetter(0))):
    print(x, y)
    sorted_predictors_names.append(y)

In [None]:
cancer_predictors = rfe.transform(cancer_df[cancer_dataset['feature_names']].values)

cancer_df_reduce = pd.DataFrame(data=cancer_predictors, columns=sorted_predictors_names[:15])
cancer_df_reduce['target'] = cancer_df['target']
cancer_df_reduce.head(5)

In [None]:
pca_cancer_reduce_df, pca_cancer_reduce = create_pca(2, cancer_df_reduce)
visualize_pca(pca_cancer_reduce_df)

In [None]:
compare_pc(cancer_df_reduce)

## Kernel PCA

PCA is a linear method. That is it can only be applied to datasets which are linearly separable. It does an excellent job for datasets, which are linearly separable. But, if we use it to non-linear datasets, we might get a result which may not be the optimal dimensionality reduction. Kernel PCA uses a kernel function to project dataset into a higher dimensional feature space, where it is linearly separable. It is similar to the idea of Support Vector Machines.

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

X, y = make_moons(n_samples=500, noise=0.02, random_state=417)

plt.scatter(X[:, 0], X[:, 1], c=y)
plt.show()


Let’s apply PCA on this dataset

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.title("PCA")
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.show()

PCA failed to distinguish the two classes

In [None]:
from sklearn.decomposition import KernelPCA

kpca = KernelPCA(kernel='rbf', gamma=15)
X_kpca = kpca.fit_transform(X)

plt.title("Kernel PCA")
plt.scatter(X_kpca[:, 0], X_kpca[:, 1], c=y)
plt.show()


Applying kernel PCA on this dataset with RBF kernel with a gamma value of 15


### KernelPCA exercises

* Visualize in 2d datasets used in this labs, experiment with the parameters of the KernelPCA method change kernel and gamma params. Docs: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html 

In [None]:
from sklearn import preprocessing

def apply_kernel_pca(df, gamma=15):
    features = df.loc[:, df.columns != 'target']
    k_pca = KernelPCA(kernel='rbf', gamma=gamma)
    features_k_pca = k_pca.fit_transform(features)
    label_encoder = preprocessing.LabelEncoder()
    label_encoder.fit(df.target)
    label_colors= label_encoder.transform(df.target)

    fig = plt.figure(figsize=(8, 8))
    ax = fig.add_subplot(1, 1, 1)
    ax.set_title('Kernel PCA', fontsize=20)

    ax.scatter(features_k_pca[:, 0], features_k_pca[:, 1], c=label_colors)
    ax.legend(df.target)
    ax.grid()

In [None]:
apply_kernel_pca(df_iris_standarize, gamma=2)

In [None]:
apply_kernel_pca(cancer_df_reduce, gamma=0.5)

## Homework

* Download the MNIST data set (there is a function to load this set in libraries such as scikit-learn, keras). It is a collection of black and white photos of handwritten digits with a resolution of 28x28 pixels. which together gives 784 dimensions.

* Try to visualize this dataset using PCA and KernelPCA, don't expect full separation of the data

* Similar to the exercises, examine explained variance. draw explained variance vs number of principal Components plot.

* Find number of principal components for 99%, 95%, 90%, and 85% of explained variance.

* Draw some sample MNIST digits and from PCA of its images transform data back to its original space (https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.inverse_transform). Make an inverse transformation for number of components coresponding with explained variance shown above and draw the reconstructed images. The idea of this exercise is to see visually how depending on the number of components some information is lost.

* Perform the same reconstruction using KernelPCA (make comparisons for the same components number)
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.KernelPCA.html#sklearn.decomposition.KernelPCA.inverse_transform


## Useful links
https://scikit-learn.org
https://towardsdatascience.com/introduction-to-principal-component-analysis-pca-with-python-code-69d3fcf19b57
https://towardsdatascience.com/kernel-pca-vs-pca-vs-ica-in-tensorflow-sklearn-60e17eb15a64

In [None]:
from sklearn.datasets import load_digits

In [None]:
mnist_dataset = load_digits()
mnist_df = sklearn_dataset_to_pandas(mnist_dataset)
mnist_df.head(5)

In [None]:
pca_mnist_df, pca_mnist = create_pca(2, mnist_df)
visualize_pca(pca_mnist_df)

In [None]:
apply_kernel_pca(mnist_df, gamma=0.01)

In [None]:
compare_pc(mnist_df)

In [None]:
# Find number of principal components for 99%, 95%, 90%, and 85% of explained variance.
mnist_df.head()
