**Principal component analysis (PCA)** is one of the most commonly used dimensionality reduction techniques in the industry. By converting large data sets into smaller ones containing fewer variables, it helps in improving model performance, visualising complex data sets, and in many more areas.

**Why PCA**

A couple of situations where having a lot of features posed problems for us are as follows:

The predictive model setup: Having a lot of correlated features lead to the multicollinearity problem. Iteratively removing features is time-consuming and also leads to some information loss.
Data visualisation: It is not possible to visualise more than two variables at the same time using any 2-D plot. Therefore, finding relationships between the observations in a data set having several variables through visualisation is quite difficult. 

**Benefits of PCA**

1. For data visualisation and EDA

2. For creating uncorrelated features that can be input to a prediction model:  With a smaller number of uncorrelated features, the modelling process is faster and more stable as well.

3. Finding latent themes in the data: If you have a data set containing the ratings given to different movies by Netflix users, PCA would be able to find latent themes like genre and, consequently, the ratings that users give to a particular genre.

4. Noise reduction

**What of PCA**
PCA is fundamentally a dimensionality reduction technique; it helps in manipulating a data set to one with fewer variables. 

In simple terms, dimensionality reduction is the exercise of dropping the unnecessary variables, i.e., the ones that add no useful information. Now, this is something that you must have done in the previous modules. In EDA, you dropped columns that had a lot of nulls or duplicate values, and so on. In linear and logistic regression, you dropped columns based on their p-values and VIF scores in the feature elimination step.

Similarly, what PCA does is that it converts the data by creating new features from old ones, where it becomes easier to decide which features to consider and which not to. 

PCA is a statistical procedure to convert observations of possibly correlated variables to ‘principal components’ such that:

1. They are uncorrelated with each other.
2. They are linear combinations of the original variables.
3. They help in capturing maximum information in the data set.


**Basis**

Basis is a unit in which we express the vectors of a matrix.
For example, we describe the weight of an object in terms of the kilogram, gram, and so on; to describe length, we use a metre, centimetre, etc. So for example, when you say that an object has a length of 23 cm, what you are essentially saying is that the object’s length is 23×1 cm. Here, 1 cm is the unit in which you are expressing the length of the object.
Similarly, vectors in any dimensional space or matrix can be represented as a linear combination of basis vectors. 

The basic definition of basis vectors is that they're a certain set of vectors whose linear combination is able to explain any other vector in that space.

In vectors and vector spaces, we use basis vectors to represent the points in space. You understood how every observation in the space can be represented by scaling and adding the scaled basis vectors. This process is also called a linear combination.

 

Then you learnt one of the key ideas that helped you connect basis vectors and the idea of dimensionality reduction: using different basis vectors to represent the same points.

 

From there, you learnt how to change from one basis space to another using matrices. Here's a list of rules to help you revise the same.

1.) If you're moving from a basis space Bto the standard basis, then the change of basis matrix M is the same as the basis vectors of B written as its column vectors. Therefore, if there is a vector v represented in B and you want to find its representation in the standard basis, then you'd have to perform Mv
 

2.) If you want to go the other way around, where you have v represented in the standard basis and want to find its representation in B you multiply it by its inverse  - $ M^-1 v $


3.) Finally, if you want to find the change of basis matrix M where you move from two non-standard basis vectors - say from B1 to B2 then you can get that by calculating this value - $ B_{2}^{-1}.B_{1} $
. Note that in all the above cases, the basis vectors should be represented in the same units.

**PCA and Change of Basis**

1. PCA finds new basis vectors for us. These new basis vectors are also known as Principal Components.
2. We represent the data using these new Principal Components by performing the change of basis calculations.
3. After doing the change of basis, we can perform dimensionality reduction. In fact, PCA finds new basis vectors in such a way that it becomes easier for us to discard a few of the features.

**Introduction to Variance**

As mentioned previously, you have already learnt certain methods through which you delete columns – by checking the number of null values, unnecessary information and in modelling by checking the p-values and VIF scores.

 

PCA gauges the importance of a column by another metric called ‘variance’ or how varied a column’s values are.

**the importance of a column by checking its variance values. If a column has more variance, then this column will contain more information.**


**Directions of Maximum Variance**
Basically, the steps of PCA for finding the principal components can be summarised as follows.

1. First, it finds the basis vector which is along the best- fit line that maximises the variance. This is our first principal component or PC1.
2. The second principal component is perpendicular to the first principal component and contains the next highest amount of variance in the dataset.
3. This process continues iteratively, i.e. each new principal component is perpendicular to all the previous principal components and should explain the next highest amount of variance.
4. If the dataset contains n independent features, then PCA will create n Principal components.

**The Workings of PCA**
1. Find n new features - Choose a different set of n basis vectors (non-standard). These basis vectors are essentially the directions of maximum variance and are called Principal Components
2. Express the original dataset using these new features
3. Transform the dataset from the original basis to this PCA basis.
4. Perform dimensionality reduction - Choose only a certain k (where k < n) number of the PCs to represent the data.  5. Remove those PCs which have fewer variance (explain less information) than others.
 

PCA's role in the ML pipeline almost solely exists as a dimensionality reduction tool. Basically, you choose a fixed number of PCs that explained a certain threshold of variance that you have chosen and then uses only that many columns to represent the original dataset. This modified dataset is then passed on to the ML pipeline for further prediction algorithms to take place. PCA helps us in improving the model performance significantly and helps us in visualising higher-dimensional datasets as well.

**Practical Considerations and Alternatives**

1. Most software packages use SVD to compute the principal components and assume that the data is scaled and centred, so it is important to do standardisation/normalisation.
2. PCA is a linear transformation method and works well in tandem with linear models such as linear regression, logistic regression, etc., though it can be used for computational efficiency with non-linear models as well.
3. It should not be used forcefully to reduce dimensionality (when the features are not correlated).

**Shortcoming**

1. PCA is limited to linearity, though we can use non-linear techniques such as t-SNE as well\
2. PCA needs the components to be perpendicular, though in some cases, that may not be the best solution. The alternative technique is to use Independent Components Analysis. 
3. PCA assumes that columns with low variance are not useful, which might not be true in prediction setups (especially classification problem with a high class imbalance).


In [5]:
import numpy as np
a=np.array([[2,7,1],[-2,1,8],[3,4,-2]])
b=np.array([[8,1,3],[3,5,8],[7,-2,-4]])

In [6]:
a@b

array([[ 44,  35,  58],
       [ 43, -13, -30],
       [ 22,  27,  49]])

In [23]:
a=np.array([[1,2],[2,-1]])
b=np.array([[1,0],[0,1]])

In [47]:
A=np.array([[3],[2]])
B2=np.array([[3,-3],[4,-5]])
B=np.linalg.inv(B2)
B@A

array([[3.],
       [2.]])

In [25]:
a@B

array([[ 1.,  1.],
       [ 2., -1.]])

In [11]:
20*0.25

5.0

In [None]:
30+20c+30d+5=30

In [None]:
30+20d=120
d=3
20c+5=35

In [51]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("Ratings.csv")
df.head()

Unnamed: 0,B1,B2,B3,B4,B5
0,1,0,4,0,3
1,2,3,4,3,2
2,3,3,2,4,2
3,4,4,3,5,4
4,5,1,4,2,2


In [52]:
from sklearn.decomposition import PCA
pca = PCA(random_state=42)
pca.fit(df)

PCA(random_state=42)

In [53]:
pca.explained_variance_ratio_

array([6.51886873e-01, 1.52836342e-01, 1.30717403e-01, 6.43810220e-02,
       1.78359842e-04])