![@mikegchambers](../../images/header.png)

# Principal Component Analysis

In this notebook, we explore Principal Component Analysis using scikit-learn to carry out dimension reduction.

![Camera](camera.png)

In [None]:
from sklearn.decomposition import PCA

from sklearn.datasets import make_blobs

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Data

Here we load out dataset.  In this case we load blobs, and we will experiment with different numbers of centers.

In [None]:
K = 3
X, y = make_blobs(n_samples=100, centers=K, n_features=3, cluster_std=1.5, random_state=0)

Let's plot the data, BUT with a 2D graph we only have room for two axis...

In [None]:
plt.scatter(X[:,0], X[:,1], c=y)

Let's plot the data again.  As we have more than 2 dimentions, lets plot a 3D graph to see as much of the data as we can.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X[:,0] ,X[:,1] ,X[:,2] ,c=y)


# Model

https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

Now we load the PCA model.  We can select the number of components to abstract with `n_components`.

In [None]:
pca = PCA(n_components=2)

We fit out X data to this unsupervised algorithum.

In [None]:
t = pca.fit_transform(X)

# Result

Now we plot the PCA components on a 2D graph (assuming that we chose 2 for the `n_components`.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

ax.scatter(t[:,0] ,t[:,1] ,c=y)

# Finding the clusters

Let's now use another algorithum, K-means Clustering, to find the clusters in the data.  This is a simple example, but using two algorithums like this in conjuntion, or chaining the models together is a common use of PCA.

In [None]:
from sklearn.cluster import KMeans

We create the model looking for the same number of clusters as we originally set.

In [None]:
# UPDATE: This line of code has been tweaked to prevent a warning abot new versions of the code.
kma = KMeans(n_clusters=K, n_init=10)

In [None]:
kma.fit(t)

Just as we did in the K-means build lesson we extract from the model the centers, and the labels.

In [None]:
clusters = kma.cluster_centers_
labels = kma.labels_

And we will plot the '2D view' of the original data, and the PCA view with K-means applied. 

In [None]:
plt.subplots(nrows=1, ncols=2, figsize=(10, 4))

plt.tight_layout()

plt.subplot(1, 2, 1)
plt.title('Original Data')
plt.scatter(X[:,0], X[:,1], c=y)

plt.subplot(1, 2, 2)
plt.title('K-means Data')
plt.scatter(t[:,0], t[:,1], c=labels)
plt.scatter(clusters[:,0], clusters[:,1], marker="x", s=400, linewidth=1, c="black", zorder=1000)

plt.show()