# Principal Component Analysis (PCA)

Topics:
1. Need for PCA
2. What is PCA?
3. Step by step computation of PCA
4. pca with Python

***

## 1. Need for PCA

![](pca.PNG)

High dimension data is extremely complex to process due to inconsistencies in the features which increase the computation time and make data processing and EDA more convoluted.  

High dimensional data can be easily found in cases such as image processing, NLP, image translation and so on. So to get rid of this curse we came up with a process called __dimensionality reduction.__ Now dimensionality reduction techniques can be used to filter only a limited number of significant features which are needed to train your ML model. This is where PCA comes into the picture. PCA helps in implementing ML easily.

## 2. What is PCA?

__Principal Component Analysis (PCA)__ is a dimensionality reduction technique that enables you to identify correlations and patterns in a data set so that it can be transformed into a dataset of significantly lower dimension without loss of any important information. 

![](pca2.PNG)

When performing dimensionality reduction using PCA or any other method, it should be performed in such a way that significant data is retained in the new dataset. Basically we are narrowing down a couple of variables from your orignal dataset to your final dataset which has all important information.


## 3. Step by step computation of PCA

![](pca3.PNG)

#### Step 1: Standardization of the data

Standardization is all about scaling your data in such a way that all the variables and their values lie within a similar range.

Missing out on standardization will result in a biased outcome i.e., output will be more impacted by the larger of the two values (eg: salary and age) so the output will not be accurate.

$$ z = \frac{x - \mu}{\sigma} $$
where,  
$\mu$ - mean  
$\sigma$ - standard deviation  
x - Variable value  



#### Step 2. Compute the covariance matrix

A covariance matrix (its a square matrix) expresses the correlation between the different variables in the dataset. It is essential to identify heavily dependant variables because they contain biased and redundant information which reduces the overall performance of the model.

![](pca4.PNG)

#### Step 3. Calculating the eigen values and eigen vectors

__Eigen vectors and eigen values__ are the mathematical constructs that must be computed from the covariance matrix in order to determine the __principal components__ of the dataset.

__Principal components__ are the new set of variables that are obtained from the initial set of variables after dimensionality reduction. They compress and possess most of the useful information that was scattered among the initial variables. These principal components must also be independant of each other and must be highly significant.


Example initially you had 50 variables and you formed 5 principle componets from these 50 variables. It means that all the necessary data which was scattered in these 50 variables can be easily represnted using these 5 variables (principal components). So you need to narrow down the principal components in such a way that you dont lose out on important information. (Reduncdant information can lower your accuracy and performance).

Initially if your data has say 5 components then you have to form 5 principal components so intially you have to start with the same dimension you have in your dataset. So the 1st principal component PC1 that you form must be formulated in such a way that it stors the maximum possible information and the 2nd principal component PC2 that you compute must store the remaining maximum information in your data and so on. Hence PC3 will be the 3rd most significant variable, PC$ will be the 4th significant variable and so on.

$$ PC1>PC2>PC3>PC4>.... $$


Important:
- Eigen vector are those vectors when a linear transformation is performed on them then their direction does not change.
- Eigen values simply denotes the scalars of the respective eigen vectors.


Eigen values and eigen vectors are always computed as pairs i.e, for every eigen vectoe there is an eigen value. The dimension of the data will tell you how many eigen vectors we need to calculate. 

Idea behind eigen vector is where in the data we have most amount of variance because covariance matrix gives u overall variance in your data. Eigen values and eigen vectors are used to understand the variance in the dataset coz more variance in the data denotes more information about the data. Eigen vectors are used to identify where in your data you have the maximum variance. Ths is the idea behind PCA as we need the maximum information. And arange the principal components in descending order of information or eigen value.


#### Step 4. Computing the Principal components

Once we have computed the eigen vectors and eigen values all we have to do is order them in descending order, where the eigen vector with the highest eigen value is the most significant and thus forms the first principal component. Hence we have a ordered list of principal components and after arranging them in descending orders we can remove lesser significant components to remove the dimensions of the data.

![](var_pca.PNG)


In this step we create a matrix known as __feature matrix__ which basically contains all the significant data variables or all the significant principal components which has maximum information about the data.

#### Step 5: Reducing the dimensions of the dataset

The last step in performing PCA is to re-arrange the original data with the final principal components which represent the maximum and the most significant information of the dataset.

So the original datset is narrowed down to reduced dataset that conatins only the most important information.

![](stp5.PNG)


### [PCA using Python](https://www.analyticsvidhya.com/blog/2016/03/pca-practical-guide-principal-component-analysis-python/)

### [Difference between PCA and factor analysis](https://discuss.analyticsvidhya.com/t/what-is-difference-between-factor-analysis-and-principal-component-analysis/2991/2)