# Principal Component Analysis(PCA)

**prerequisite**: 
- covariance matrix
- eigen values eigen vectors

**Principal Components**:Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components. So, the idea is 10-dimensional data gives you 10 principal components, but PCA tries to put maximum possible information in the first component, then maximum remaining information in the second and so on, until having something like shown in the scree plot below.


## introduction
PCA is a dimensionality reducion procedure. It reduces higher dimensions of a given dataset into lower dimension to visualize and to work efficiently. PCA does not reduces number of features manually, rather it finds the principal components of the datasets and then condider these components as new datasets. This is done with help of covariance matrix and eigen vectors. Although for this we have to consider a little bit more loss but efficient model.

## Algorithm
- Standardize the range of continuous initial variables
- Compute the covariance matrix to identify correlations
- Compute the eigenvectors and eigenvalues of the covariance matrix to identify the principal components
- Create a feature vector to decide which principal components to keep
- Recast the data along the principal components axes

### Step 1:
**standardization**: Before applying PCA it is important to standardize the data. Because PCA is sensitive to the variance of the data. For example, if one value ranges from (1-100) will have greater impact than the value having range (0-1),which lead to biased result.
Formula:  x=(x-mean)/std

In [2]:
import numpy as np

# step 1 standardization
# initializing data
X=np.array([68,60,58,40],dtype=float)
Y=np.array([29,26,30,35],dtype=float)
data=np.vstack((X,Y))
data=data.T # now each column represents a example
data-=np.mean(data,axis=0)
print(data)


[[ 11.5  -1. ]
 [  3.5  -4. ]
 [  1.5   0. ]
 [-16.5   5. ]]


### step 2: 
**Covarinace matrix**:
The goal of covariance matrix is to find how data points are dispersed from the mean. Moreover, we can see correlation of data points and truncate redundant features.

We have discussed covariance matrix earlier in another section.

In [3]:
# covariance matrix
cov_matrix=np.dot(data.T,data)/data.shape[0]
print(cov_matrix)

[[104.75 -27.  ]
 [-27.    10.5 ]]


### Step 3: Computing the eigen vectiors and eigen values of the covariance matrix to identify principal components
principal components represent the directions of the data that explain a maximal amount of variance, that is to say, the lines that capture most information of the data.

eigen vectors and eigen values in python . first eigen values $A-\lambda I$ to determine lamda. then with lamda values we will determine eigen vectors.

In [4]:
# eigen values
#I=np.eye(*cov_matrix.shape)
eigen_val, eigen_vec=np.linalg.eig(cov_matrix)
print(f"eigen values : {eigen_val}")
print(f"eigen vectors \n {eigen_vec}")




eigen values : [111.93674482   3.31325518]
eigen vectors 
 [[ 0.96635295  0.25721971]
 [-0.25721971  0.96635295]]


In [5]:
for i in range(len(eigen_val)):
    print(f"for eigen value {i+1} the eigen vector is {eigen_vec[:,i]} ")
                     


for eigen value 1 the eigen vector is [ 0.96635295 -0.25721971] 
for eigen value 2 the eigen vector is [0.25721971 0.96635295] 


 **Summary**: The dataset has two principal components. The first principal component explains the majority of the variance in the data, and its direction is given by the first eigenvector. The second principal component explains less variance and is orthogonal to the first principal component, with its direction given by the second eigenvector.

  - eigen vectors are actually the directions of the axes where there is the most variance (most information) and that we call Principal Components.
  - eigenvalues are simply the coefficients attached to eigenvectors, which give the amount of variance carried in each Principal Component.

### PCA analysis
In our example first eigen value is greater than second. This meas that eigen vector that corresponds to the first principal component(pc1) is v1 and the other which corresponds to (pc2) is v2.

After having the principal components, to compute the percentage of variance (information) accounted for by each component, we divide the eigenvalue of each component by the sum of eigenvalues.

In [6]:
first_component_var=(eigen_val[0]/np.sum(eigen_val))*100
second_component_var=(eigen_val[1]/np.sum(eigen_val))*100
print(f"PC1 = {first_component_var:.3f}%, PC2 = {second_component_var:.3f}%")

PC1 = 97.125%, PC2 = 2.875%


### Step 4 : feature vector
As we saw in the previous step, computing the eigenvectors and ordering them by their eigenvalues in descending order, allow us to find the principal components in order of significance. In this step, what we do is, to choose whether to keep all these components or discard those of lesser significance (of low eigenvalues), and form with the remaining ones a matrix of vectors that we call Feature vector.

So, the feature vector is simply a matrix that has as columns the eigenvectors of the components that we decide to keep. This makes it the first step towards dimensionality reduction, because if we choose to keep only p eigenvectors (components) out of n, the final data set will have only p dimensions.


In our upper example we can consider both eig vectors and construct feature vector as follows:

\begin{bmatrix}
0.96635295 & 0.25721971\\
-0.25721971&  0.96635295
\end{bmatrix}

where each row represents a eigen vectors.

Or we can consider only first eigen vector to cnstruct feature vector as follows

\begin{bmatrix}
0.96635295\\ 
-0.25721971
\end{bmatrix}




This is how two dimension can be transformed into one dimension. In real world problem the dimensions are much more higher.

In [7]:
# we are considering only first eigen vector
featur_vec=np.array(eigen_vec[:,0])
print(featur_vec)

[ 0.96635295 -0.25721971]


In [8]:
print(featur_vec.shape)
featur_vec=featur_vec.reshape(2,1)
print(featur_vec.shape)

(2,)
(2, 1)


### Recast the data along the principal components axes

In this last step we will reorient the data in the main datasets along the principal component axes. This can be done by myltiplying the (maindata.T * feature vectors)

In [9]:
Final_dataset=np.dot(data,featur_vec)
print(Final_dataset)

[[ 11.37027863]
 [  4.41111415]
 [  1.44952942]
 [-17.2309222 ]]
