# Principal Component Analysis - PCA
> Step-by-step use of PCA for dimensionality reduction.

- toc: true 
- badges: true
- comments: true
- categories: [jupyter, Dimensionality, PCA]

In [9]:
from IPython.display import Image

# Background

Data compression is an important topic of machine learning, as it allows us to analyse and interpret big amount of data. In ML, *feature extraction* techniques allows us to reduce the number of features in a dataset. Different from feature selection whchig maintain the original features, feature extraction transform or project the data onto new feature spaces.

Feature extraction improves the predictive performance of a given model by reducing the *curse of dimensionality*.
PCA finds correlation between the features by finding the directions of maximum variance in high-dimentional data and it projects the data onto a new subspace with equal or fewer dimensions. The ortogonal axes (principal components) of the new subspace should be interpreted as the directions of maximum variance given the constraint constraint that the new feature axes are ortogonal to each other [1].

Image(filename='images/GaussianScatterPCA.png')


As described in [1], The main steps behing PCA are:

1) Standardize the d-dimentional dataset
2) Construct the covariance matrix
3) Decompose the covariance matrix into its eigenvectors and eigenvalues
4) Sort the eigenvalues by decreasing order to rank the corresponding eigenvectors
5) Select k eigeinvectors, which correspond to the k largest eigenvalues, where k is the dimensionality of the new feature space ($(k <= d))
6) Construct a projection matrix, W, from the top k eigenvectors.
7) Transform the d-dimensional input dataset, X, using the projection matrix W, to obtain the new k-dimensional feature space.




[1] Python Machine Learning - by Sebastian Raschka



In [2]:
import pandas as pd
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/'
                      'machine-learning-databases/wine/wine.data',
                      header=None)

In [3]:
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue',
                   'OD280/OD315 of diluted wines', 'Proline']

df_wine.head()

Unnamed: 0,Class label,Alcohol,Malic acid,Ash,Alcalinity of ash,Magnesium,Total phenols,Flavanoids,Nonflavanoid phenols,Proanthocyanins,Color intensity,Hue,OD280/OD315 of diluted wines,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [5]:
# Lets split in training and testing dataset
from sklearn.model_selection import train_test_split

# First column - class label is the target
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.3, 
                     stratify=y,
                     random_state=0)

In [None]:
#