# Principal Components Analysis

The amount of data generated each day from various sources such as scientific experiments, cell phones, smart watches, and other data products is increasing exponentially everyday. Certain machine learning tasks involve analyzing high-dimensional datasets (ie: datasets with a large number of features). Examples of these datasets could include: image data [ie: features could be image pixels] or recommender data [ie: where the features are a list of all movies rated by users]. We know that machine learning models can be used to classify or cluster these data in order to predict future events. Yet, datasets with large features pose a unique challenge for machine learning analysis. Datasets with a large number of features add complexity to certain machine learning models (ie: linear models), as a result, machine learning models that train on datasets with large features are more prone to producing error due to bias. Therefore, in order to reduce the likelihood of error, it might be helpful to compress the data and describe the data only using a few values before implementing supervised or unsupervised machine learning models.  		 Features within a large dataset may vary in a similar way. For example, consider data from a movie rating system where the movie ratings from different users are instances and the various movies are features (Figure 1). 

![Figure 1](https://raw.githubusercontent.com/lesleymaraina/PCA/master/Slide1.png)

In this dataset, you might observe that users who rank <i>Star Wars Episode IV </i>highly might also rank <i>Rogue One: A Star Wars Story </i>highly. In other words, the ratings for <i>Star Wars Episode IV</i> are positively correlated with <i>Rogue One: A Star Wars Story</i>. One could image that all movies ranked by certain users in a similar way might all share a similar attribute (ie: all movies with these rankings are classic sci-fi movies), and could ultimately be grouped to form a new feature (<b>Figure 2</b>).
![Figure 2](https://raw.githubusercontent.com/lesleymaraina/PCA/master/Slide2.png)

This is an intuitive way of grouping data, but it would take quite some time to read through all of the data and group it according to similar attributes. Fortunately, there are algorithms that can automatically group features that vary in a similar way within high-dimensional datasets, these methods are called: dimensionality reduction algorithms. 
Principal component analysis (PCA) is a dimensionality reduction technique used to transform high-dimensional datasets into a dataset with fewer variables, and the set of resulting variables explains the maximum variance within the dataset[<b>1</b>]. PCA is used prior to unsupervised and supervised machine learning steps to reduce the number of features used in the analysis[<b>2</b>]. The overall goal of PCA is to reduce the number of <b>d</b>-dimensions (ie: features) in a dataset by projecting it onto a <b>k</b>-dimensional subspace where <b>k < d </b> [<b>3</b>]. The approach used to complete PCA can be summarized as follows:


1.	Standardize the data 
2.	Use the standardized data to generate a covariance matrix (or perform Singular Vector Decomposition)
3.	Obtain eigenvectors (principal components) and eigenvalues from the covariance matrix; each eigenvector will have a corresponding eigenvalue
4.	Sort the eigenvalues in descending order
5.	Select the <b>k</b> eigenvectors with the largest eigenvalues;  <b>k</b> is the number of dimensions used in the new feature space (<b>k≤d</b>)
6.	Construct a new matrix with the selected <b>k</b> eigenvectors

[<b>steps 1-6 source: Raschka(2105)</b>]


The following is an example of PCA analysis. We will use PCA to reduce the dimensions within the MovieLens movie ratings dataset. The data used in the following tutorial can be found [here](https://movielens.org).

## Part 1: Load and Standardize Data

We’ll load the data from MovieLens.org, and store the data in a pandas dataframe. The data set contains ratings from 718 users for 8913 movies (features).  Even though all of the features in the dataset are measured on the same scale (ie: all ratings are on a scale (0-5)),  we must make sure that we standardize the data by transforming the data onto a unit scale (mean=0 and variance=1). Also, all NaN values were converted to 0. It is necessary to transform data because PCA can only be applied on numerical data[<b>4</b>].

In [13]:
#Load dependencies
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from matplotlib import*
import matplotlib.pyplot as plt
from matplotlib.cm import register_cmap
from scipy import stats
#from wpca import PCA
from sklearn.decomposition import PCA as sklearnPCA
import seaborn

In [17]:
#Load movie names and movie ratings
movies = pd.read_csv('https://raw.githubusercontent.com/lesleymaraina/PCA/master/movies.csv')
ratings = pd.read_csv('https://raw.githubusercontent.com/lesleymaraina/PCA/master/ratings.csv')
ratings.drop(['timestamp'], axis=1, inplace=True)

def replace_name(x):
	return movies[movies['movieId']==x].title.values[0]
ratings.movieId = ratings.movieId.map(replace_name)

M = ratings.pivot_table(index=['userId'], columns=['movieId'], values='rating')
m = M.shape
m

df1 = M.replace(np.nan, 0, regex=True)
X_std = StandardScaler().fit_transform(df1)

## Part 2: Covariance Matrix and Eigendecomposition

Next, a covariance matrix is created based on the standardized data. The covariance matrix is a representation of the covariance between each feature in the original dataset3. The covariance matrix can be found as follows:

In [None]:
#Create a covariance matrix
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance matrix \n%s' %cov_mat)

#Create the same covariance matrix with 1 line of code
print('NumPy covariance matrix: \n%s' %np.cov(X_std.T))

Covariance matrix 
[[ 1.0013947  -0.00276421 -0.00195661 ..., -0.00858289 -0.00321221
  -0.01055463]
 [-0.00276421  1.0013947  -0.00197311 ...,  0.14004611 -0.0032393
  -0.01064364]
 [-0.00195661 -0.00197311  1.0013947  ..., -0.00612653 -0.0022929
  -0.00753398]
 ..., 
 [-0.00858289  0.14004611 -0.00612653 ...,  1.0013947   0.02888777
   0.14005644]
 [-0.00321221 -0.0032393  -0.0022929  ...,  0.02888777  1.0013947
   0.01676203]
 [-0.01055463 -0.01064364 -0.00753398 ...,  0.14005644  0.01676203
   1.0013947 ]]
NumPy covariance matrix: 
[[ 1.0013947  -0.00276421 -0.00195661 ..., -0.00858289 -0.00321221
  -0.01055463]
 [-0.00276421  1.0013947  -0.00197311 ...,  0.14004611 -0.0032393
  -0.01064364]
 [-0.00195661 -0.00197311  1.0013947  ..., -0.00612653 -0.0022929
  -0.00753398]
 ..., 
 [-0.00858289  0.14004611 -0.00612653 ...,  1.0013947   0.02888777
   0.14005644]
 [-0.00321221 -0.0032393  -0.0022929  ...,  0.02888777  1.0013947
   0.01676203]
 [-0.01055463 -0.01064364 -0.00753398 ...,  

After the covariance matrix is generated, eigendecomposition is performed on the covariance matrix. Eigenvectors and eigenvalues are found as a result of the eigendceomposition. Each eigenvector has a corresponding eigenvalue, and the sum of the eigenvalues represents all of the variance within the entire dataset. The eigendecomposition can be completed as follows:

In [None]:
#Perform eigendecomposition on covariance matrix
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors \n%s' %eig_vecs)
print('\nEigenvalues \n%s' %eig_vals)

## Part 3: Selecting Principal Components

Eigenvectors or principal components are a normalized linear combination of the features in the original dataset4. The first principal component captures the most variance in the original variables, and the second component is a representation of the second highest variance within the dataset. For example, if you were to plot data from a dataset that contains 2 features, the following illustrates that principal component 1 (PC1) represents the direction of the most variation between the 2 features and PC2 represents the second most variation between the 2 plotted features (<b>Figure 3</b>). Our movies data contains over 8000 features and would be difficult to visualize which is why we used eigendecomposition to generate the eigenvectors. 

![Figure 3](https://raw.githubusercontent.com/lesleymaraina/PCA/master/Slide3.png)

The eigenvectors with the lowest eigenvalues describe the least amount of variation within the dataset. Therefore, these values can be dropped. First, lets order the eigenvalues in descending order:

In [None]:
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
print('Eigenvalues in descending order:')
for i in eig_pairs:
    print(i[0])

To get a better idea of how principal components describe the variance in the data, we will look at the explained variance ratio of the first 2 principal components. 

In [None]:
pca = PCA(n_components=2)
pca.fit_transform(df1)
print pca.explained_variance_ratio_ 

The first 2 principal components describe ~14% of the variance in the data. In order gain a more comprehensive idea on how each principal component explains the variance within the data, we will construct a scree plot. A scree plot displays the variance explained by each principal component within the analysis[<b>5</b>].

In [None]:
#Explained variance
pca = PCA().fit(X_std)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')
plt.show()

Our scree plot shows that the first 480 principal components describe most of the variation (information) within the data. This is a major reduction from the initial 8913 features. Therefore, the first 480 eigenvectors should be used to construct the dimensions for the new feature space.

##### References:
1.	http://scikit-learn.org/stable/modules/decomposition.html  
2.	http://scikit-learn.org/stable/modules/unsupervised_reduction.html
3.	Raschka, Sebastian. Principle Component Analysis in 3 Simple Steps, (2015). http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html 
4.	http://www.analyticsvidhya.com/blog/2016/03/practical-guide-principal-component-analysis-python/ 
5.	http://support.minitab.com/en-us/minitab/17/topic-library/modeling-statistics/multivariate/principal-components-and-factor-analysis/what-is-a-scree-plot/ 