# Principal component analysis
Machine learning in general works wonders when the dataset provided for training the machine is large and concise. Usually having a good amount of data lets us build a better predictive model since we have more data to train the machine with. However, using a large dataset has its own pitfalls. The biggest pitfall is the curse of dimensionality. 

It turns out that in large dimenstional datasets, there might be lots of inconsistencies in the features or lots of redundant featurs in the dataset, which only will increase the computation time and make data processing and EDA more convoluted. 

To get rid of the curse of dimensionality, a process called dimensionality reduction was introduced. Dimensionality reduction techniques can be used to filter only in a limited number of significant features needed for training and this is where PCA comes in. 

## What is PCA?
Principal Components Analysis (PCA) is dimensionality reduction technique that enables you to identify correlations and patters in a data set so that it can be transofmred into a data set of significantly lower dimension without loss of any important information. 

## Step-by-Step Computation of PCA
The below steps need to be followed to perform dimensionality reduction using PCA:

- Standardization of the data
- Computing the covariance matrix
- Calculating the eigenvectors and eigenvalues
- Computing the Principal Components
- Reducing the dimensions of the data set

Here are all the steps in detail: 

### Step 1: Standardization of the data 
Standardization is all about scaling your data in such a way that all the variables and their values lie within a similar range. 

Consider an example, let's say that we have 2 variables in a our data set, one has values ranging between 0 - 100 and the other has values between 1,000 - 5,000. In such a scenario, it is obvious that the output calculated by using these predictor variables is going to be biased since the variable with a larger range will have a more obvious impact on the outcome. 

Therefore, standardizing the data into a comparable range is very important. Standardization is carried out by subtracting each value in the data from the mean and dividing it by the overall deviation in the data set. 

It can be calculated like so:

Z = (Variable value - mean)/Standard deviation
                
Post this step, all the variables in the data are scaled across a standard and comparable scale. 

### Step 2: Computing the covariance matrix
As mentioned earlier, PCA helps to identify the correlation and dependencies among the features in a data set. A covariance matrix expresses the correlation between the different variables in the data set. It is essential to identify heavily dependent variables because they contain biased and redundant information which reduces the overall performance of the model. 

Mathematically, a covariance matrix is a p x p matrix, where p represents the dimensions of the data set. Each entry in the matrix represents the covariance of the corresponding variables. 

Consider a case where we have a 2D data set with variables a and b, the covariance matrix is a 2 x 2 matrix as show below

$$\begin{bmatrix}
Cov(a,a) & Cov(a,b) \\
Cov(b,a) & Cov(b,b) 
\end{bmatrix}$$

In the above matrix:
- Cov(a,a) represents the covariance of a variable with itself, which is nothing but the variance of the variable 'a'
- Cov(a,b) represents the covariance of the variable 'a' with respect to the variable 'b'. And since covariance is commutative, Cov(a,b) = Cov(b,a)

Here are the key takeaways from the covariance matrix:

- The covariance values denotes how co-dependent two variables are with respect to each other
- If the covariance value is negative, it denotes the respective variables are indirectly proportional to each other
- A positive covariance denotes that the respective variables are directly proportional to each other

### Step 3: Calculating the Eigenvectors and Eigenvalues
Eigenvectors and eigenvalues are the mathematical constructs that must be computed from the covariance matrix in order to deteremine the principal components of the data set. But first, let's understand more about principal components

### What are principal components?
Simply put, principal components are the new set of variables that are obtained from the initial set of variables. The principal components are computed in such a manner that newly oobtained variables are highly significant and independent of each other. The principal components compress and possess most of the useful information that was scattered among the initial variables. 

If your data set is of 5 dimensions, then 5 principal components are computed, such that, the first principal component stores the maximium possible information and the second one stores the remaining maximum info and so on, you get the idea. 

Now, where do Eigenvectors fall into this whole process?

Assuming that all have a basic understanding of Eigenvectors and eigenvalues, we know that these two algebraic forumaltions are always computed as a pair, i.e., for every eigenvector there is an eigenvalue. The dimensions in the data determine the number of eigenvectors that you need to calculate. 

Consider a 2-Dimensional data set, for which 2 eigenvectors (and their respective eigenvalues) are computed. The idea behind eigenvectors is to use the Covariance matrix to understnad where in the data there is the most amount of variance. Since more variance in the data denotes more information about the data, eigenvectors are used to identify and compute Principal Components. 

Eigenvalues, on the other hand, simply denote the scalars of the respective eigenvectors. Therefore, eigenvectors and eigenvalues will compute the Principal Components of the data set.

### Step 4: Computing the Principal Components
Once we have computed the Eigenvectors and eigenvalues, all we have to do is order them in the descending order, where the eigenvector with the highest eigenvalue is the most significant and thus forms the first principal component. The principal components of lesser significances can thus be removed in order to reduce the dimensions of the data. 

The final step in computing the Principal Components is to form a matrix known as the feature matrix that contains all the significant data variables that possess maximum information about the data. 

### Step 5: Reducing the dimensions of the data set
The last step in performing PCA is to re-arrange the original data with the final principal components which represent the maximum and the most significant information of the data set. In order to replace the original data axis with the newly formed Principal Components, you simply multiple the transpose of the original data set by the transpose of the obtained feature vector. 

So that was the theory behind the entire PCA process. It's time to get our hands dirty and perform all these steps by using a real data set. 

## Principal Component Analysis Using Python
In this section, we will be performing PCA by using Python.

#### Problem statement: 
To perform step-by-step Principal Component Analysis in order to reduce the dimension of the data set.

#### Data Set Description:
Movies rating data set that contains ratings from 700+ users for approximately 9,000 movies (features).

#### Logic:
Perform PCA. by finding the most significant features in teh data. PCA will be perfroemd by following the steps that were defined above

#### Step 1: Import required packages

In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from matplotlib import *
import matplotlib.pyplot as plt
from matplotlib.cm import register_cmap
from scipy import stats
from sklearn.decomposition import PCA as sklearnPCA
import seaborn

#### Step 2: Import data set

In [12]:
#Load movie names and movie ratings
movies = pd.read_csv('~/Documents/GitHub/professional-development/PCA/movies.csv')
ratings = pd.read_csv('~/Documents/GitHub/professional-development/PCA/ratings.csv')
ratings.drop(['timestamp'], axis=1, inplace=True)

#### Step 3: Formatting the data

In [13]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [14]:
movies.columns.to_list()

['movieId', 'title', 'genres']

In [15]:
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [16]:
def replace_name(x):
    return movies[movies['movieId'] == x].title.values[0]

In [17]:
ratings.movieId = ratings.movieId.map(replace_name)
M = ratings.pivot_table(index=['userId'], columns=['movieId'], values='rating')
m = M.shape
df1 = M.replace(np.nan, 0, regex=True)

In [18]:
X_std = StandardScaler().fit_transform(df1)

In [19]:
mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.shape[0]-1)
print('Covariance matrix n%s' %cov_mat)

Covariance matrix n[[ 1.00164204 -0.00164473 -0.00232791 ...  0.32582147 -0.00819887
  -0.00164473]
 [-0.00164473  1.00164204  0.70768614 ... -0.00360024 -0.00819887
  -0.00164473]
 [-0.00232791  0.70768614  1.00164204 ... -0.00509569 -0.01160448
  -0.00232791]
 ...
 [ 0.32582147 -0.00360024 -0.00509569 ...  1.00164204 -0.01794692
  -0.00360024]
 [-0.00819887 -0.00819887 -0.01160448 ... -0.01794692  1.00164204
  -0.00819887]
 [-0.00164473 -0.00164473 -0.00232791 ... -0.00360024 -0.00819887
   1.00164204]]


In [None]:
#Calculating eigenvectors and eigenvalues on covariance matrix
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
print('Eigenvectors n%s' %eig_vecs)
print('nEigenvalues n%s' %eig_vals)

In [None]:
# Visually confirm that the list is correctly sorted by decreasing eigenvalues
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
print('Eigenvalues in descending order:')
for i in eig_pairs:
print(i[0])

In [None]:
pca = PCA(n_components=2)
pca.fit_transform(df1)
print(pca.explained_variance_ratio_)