# Principal Component Analysis (PCA) in Python #

Killian McKee

### Overview ###

1. [What is PCA?](#section1)
2. [Key Terms](#section2) 
3. [Pros and Cons of PCA](#section3)
4. [When to use PCA](#section4)
5. [Key Parameters](#section5)
6. [Walkthrough: PCA for data visualization](#section6)
7. [Walkthrough: PCA w/ Random Forest](#section7)
7. [Additional Reading](#section8)
8. [Conclusion](#section9)
9. [Sources](#section10)

<a id='section1'></a>

### What is Principal Component Analysis? ###

Principal component analysis is a non-parametric data science tool that allows us to identify the most important variables in a data set consisting of many correlated variables.  In more technical terms, pca helps us reduce the dimensionality of our feature space by highlighting the most important variables (principal components) of a dataset via orth0gonalization. Pca is typically done before a model is built to decide which variables to include and to eliminate those which are overly correlated with one another. Principal component analysis provides two primary benefits; firstly, it can help our models avoid overfitting by eliminating extraneous variables that are most likely only pertinent (if at all) for our training data, but not the new data it would see in the real world. Secondly, performing pca can drastically improve model training speed in high dimensional data settings (when there are lots of features in a dataset). 

<a id='section2'></a>

### Key Terms ### 

1. **Dimensionality**: the number of features in a dataset (represented by more columns in a tidy dataset). Pca aims to reduce excessive dimensionality in a dataset to improve model performance. 
2. **Correlation**: A measure of closeness between two variables, ranging from -1 to +1. A negative correlation indicates that when one variable goes up, the other goes down (and a posistive correlation indicates they both move in the same direction). PCA helps us eliminate redundant correlated variables.
3. **Orthagonal**: Uncorrelated to one another i.e. they have a correlation of 0. PCA seeks to find an orthgonalized subset of the data that still captures most/all of the important information for our model. 
4. **Covariance Matrix**: A matrix we can generate to show how correlated variables are with one another. This can be a helpul tool to visualize what features PCA may or may not eliminate. 

<a id='section3'></a>

### Pros and Cons of PCA ###

There are no real cons of PCA, but it does have some limitations:  

**Pros**: 

1. Reduces model noise
2. Easy to implement with python packages like pandas and scikit-learn 
3. Improves model training time

**Limitations**: 

1. Linearity: pca assumes the principle components are a linear combination of the original dataset features. 
2. Variance measure: pca uses variance as the measure of dimension importance. This can mean axes with high variance can be treated as principle components and those with low variance can be cut out as noise. 
3. Orthogonality: pca assumes the principle components are orthogonal, and won't produce meaningful results otherwise. 

<a id='section4'></a>

### When to use Principal Component Analysis ### 

One should consider using PCA when the following conditions are true: 

1. The linearity, variance, and orthogonality limitations specified above are satisfied. 
2. Your dataset contains many features 
3. You are interested in reducing the noise of your dataset or improving model training time

<a id='section5'></a>

### Key Parameters ###

The number of features to keep post pca (typically denoted by n_components) is the only major parameter for PCA. 

<a id='section6'></a>

### PCA Walkthrough: Data Visualization ### 

We will be modifying scikit-learn's tutorial on fitting PCA for visualization using the iris [dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set) (contains different species of flowers). 

In [102]:
# import the necessary packages 

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import decomposition
from sklearn import datasets


#specify graph parameters for iris and load the dataset 

centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()


# set features and target 

X = iris.data
y = iris.target


# create the chart 

fig = plt.figure(1, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)


# fit our PCA 

plt.cla()
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)


# plot our data

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
    ax.text3D(X[y == label, 0].mean(),
              X[y == label, 1].mean() + 1.5,
              X[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
    
    
# Reorder the labels to have colors matching the cluster results

y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.nipy_spectral,
           edgecolor='k')

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])

plt.show()

#we can clearly see the three species within the iris dataset and how the differ from one another

<Figure size 400x300 with 1 Axes>

<a id='section7'></a>

### Walkthrough: PCA w/ Random Forest ### 

In this tutorial we will be walking through the typical workflow to improve model speed with PCA, then fitting a random forest. We will be working with the iris dataset again, but we will load it into a pandas dataframe  

In [149]:
# import necessary packages 

import numpy as np  
import pandas as pd  
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix  
from sklearn.metrics import accuracy_score


In [150]:
# download the data 

data = datasets.load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['target'] = data['target']
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [164]:
# split the data into features and target 

X = df.drop('target', 1)  
y = df['target']  

In [165]:
#creating training and test splits 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)  

In [166]:
# scaling the data 
# since pca uses variance as a measure, it is best to scale the data 

sc = StandardScaler()  
X_train = sc.fit_transform(X_train)  
X_test = sc.transform(X_test)  

In [167]:
# apply and fit the pca 
# play around with the n_components value to see how the model does 

pca = PCA(n_components=4)  
X_train = pca.fit_transform(X_train)  
X_test = pca.transform(X_test)

In [168]:
# generate the explained variance, which shows us how much variance is caused by each variable 
# we can see from the example below that more than 96% of the data can be explained by the first two principle components

explained_variance = pca.explained_variance_ratio_  
explained_variance

array([0.72226528, 0.23974795, 0.03338117, 0.0046056 ])

In [169]:
# now lets fit a random forest so we can see how the accuracy changes with different levels of components 
# this model has all the components 

classifier = RandomForestClassifier(max_depth=2, random_state=0)  
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)  

In [170]:
# all component model accuracy 
# we can see it achieves an accuracy of 93% 

cm = confusion_matrix(y_test, y_pred)  
print(cm)  
print('Accuracy', accuracy_score(y_test, y_pred))  

[[11  0  0]
 [ 0 10  3]
 [ 0  1  5]]
Accuracy 0.8666666666666667


In [175]:
# Now lets see how the model does with only 2 components 
# our accuracy decreases by about 3%, but we can see how this might be useful if we had 100s of components

pca = PCA(n_components=2)  
X_train = pca.fit_transform(X_train)  
X_test = pca.transform(X_test)  

In [176]:
classifier = RandomForestClassifier(max_depth=2, random_state=0)  
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)  

In [177]:
cm = confusion_matrix(y_test, y_pred)  
print(cm)  
print('Accuracy', accuracy_score(y_test, y_pred))  

[[11  0  0]
 [ 0 10  3]
 [ 0  2  4]]
Accuracy 0.8333333333333334


<a id='section8'></a>

### Additional Reading ###

1. Going into much greater depth on [PCA](https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf)
2. Visualizing [PCA](http://setosa.io/ev/principal-component-analysis/)

<a id='section9'></a>

### Conclusion ### 

This guide explained how principal component analysis helps reduce noise in our dataset and improve model speed via a simplified feature space. Next, we looked at some of the key components and limitations of PCA, namely the number of preserved components and the linearity, othogonality, and variance requirements, respectively. Lastly, we stepped through two examples of how to implement PCA; the first covered visualization, while the second tackled PCA as a preprocessing step with random forests. 

<a id='section10'></a>

### Sources ### 

1. https://arxiv.org/pdf/1404.1100.pdf?utm_campaign=buffer&utm_content=bufferb37df&utm_medium=social&utm_source=facebook.com 
2. https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/
3. https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
4. http://setosa.io/ev/principal-component-analysis/

