#### Principal Component Analysis (PCA) 

Is a dimensionality reduction technique that helps with feature extraction and data visualization. PCA is used to transform a higher dimensional space to a lower dimensional space. 


Advantages of PCA:

1) Removes correlated features

2) Reduces overfitting

3) PCA improves model performance

4) Improves visualization

Disadvantages of PCA:

1) Less interpretable. After PCA, the Principal components will be a linear combinations of features and will be hard to interpret and read.

2) Data should be scaled before applying PCA otherwise feature with small variance will be neglected. 



1) Dimensionality - The number of features in a dataset.

2) Correlation - the linear relationship between two features. 

3) Orthogonal - correlation between any pair of features is zero.

4) Eigen vectors - $v$ is an eigen vector of A is $ A v = \lambda v $ where $\lambda$ is a scalar.

PCA steps:

1) Normalize the data

2) Calculate the covariance matrix

<img src="covariance1.png, width=300, height=200>

3) Calculate the eigen values and eigen vectors

$ det( A - \lambda I)  = 0 $

$ ( A - \lambda I)v  = 0 $

4) Choosing components - arrange eigen values from highest to lowest and choose the componenets with highest eigen values. 

5) Forming Principal components 

new points = feature vectors * scaled data

new data is the matrix consisting the principal components, feature vector is the matrix with eigen vectors what we considered, scaled data is the scaled version of the original data.

Where is PCA used:

1) Facial recognition, computer vision and image compression

2) Finding patterns in data mining, bioinformatics, psychology, finance 

Let's find the eigen values of eigen vectors 

$ A =[ [2, -1], [-1, 2]] $

In [None]:
"""
In-class activity: If A = [[ 1, 2], [4, 3]]
"""

Let's perform PCA on the below dataset

<img src="matrixA.png", width=300, height=200>

this dataset has three features.

The mean of the features is

<img src="meanofA.png", width=300, height=200>

The covariance formula is

<img src="conv_formula.png", width=300, height=200>


The covariance matrix is 

<img src="covariance1.png", width=300, height=200>

The original dataset along with the convariance 

<img src="con2.png", width=300, height=200>

From the covariance matrix we can say that:

1) Art has the biggest variance and English has the least.

2) The covariance between Math and English is 360, while the covariance between MAtha nd Art is 180. 

3) The covariance beteen Art and English is 0. 

Let's compute the eigen values of A

<img src="eigen1.png", width=300, height=200>

<img src="eigen2.png", width=400, height=300>

<img src="eigen3.png", width=300, height=200>

<img src="eigen4.png", width=300, height=200>

<img src="eigen5.png", width=300, height=200>

<img src="eigen6.png", width=300, height=200>

<img src="eigen7.png", width=300, height=200>

<img src="eigen8.png", width=300, height=200>

With PCA, we transformed a three dimensional space to two-dimensional space.

References:
https://towardsdatascience.com/the-mathematics-behind-principal-component-analysis-fff2d7f4b643

In [None]:
%matplotlib inline
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
# load dataset into Pandas DataFrame
df = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','target'])

In [None]:
print(df.head())

In [None]:
# Standardizing the data to mean zero and variance 1

from sklearn.preprocessing import StandardScaler
features = ['sepal length', 'sepal width', 'petal length', 'petal width']
# Separating out the features
x = df.loc[:, features].values
# Separating out the target
y = df.loc[:,['target']].values
# Standardizing the features
x = StandardScaler().fit_transform(x)

In [None]:
print(x[0:5])

In [None]:
# using PCA to reduce the dimensionality from 4 to 2

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1', 'principal component 2'])

In [None]:
print(principalDf.head())

In [None]:
# concatenating the target values and the new principal components

finalDf = pd.concat([principalDf, df[['target']]], axis = 1)

In [None]:
# Visualization

fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
    indicesToKeep = finalDf['target'] == target
    ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
               , finalDf.loc[indicesToKeep, 'principal component 2']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

#### Important measure

Explained variance tells us how much inofrmation (variance) 
can be attributed to each principal component.

In [None]:
pca.explained_variance_ratio_

#### Conclusion

This means that the first principal component contians 72.77% of the variance and the second principal componenet contains 23.03% of the variance. Together they contain 95.08% of the information.

References:
https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

In [None]:
"""
In-class activity:

Apply PCA to the income dataset that you used for the last 
homework 2 and find the explained variance ratio for each component when 
n_components is 2 and when n-components is 3.
"""