# Unsupervised Machine Learning - Dimensionality Reduction

In this notebook we'll explore some unsupervised machine learning approaches where we don't have the data handed to us on a silver platter. Rather we will try to exploit inherent structure in our data in order to compress (dimensionality reduction) the feature space of our data
***

## Principal Component Analysis

The first kind of unsupervised we'll do, and perhaps the most canonical of all feature reduction techniques, is the **principal components analysis**. As usual we'll start with a very visual low-dimensional case then move onto higher dimensional cases and discuss how to deal with them:

In [12]:
from sklearn.datasets import make_regression
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

In [16]:
%matplotlib notebook
plt.style.use('seaborn')

In [28]:
X,y = make_regression(n_samples = 300, n_features = 2, n_informative = 1, effective_rank = 1,
                     noise = 3, n_targets = 1, bias=5, random_state=123)

df = pd.DataFrame(np.c_[X,y])
colnames = ['x1','x2','x3']
df.columns = colnames
df['x3'] -= df['x3'].mean()
df.head()

Unnamed: 0,x1,x2,x3
0,0.007147,0.011417,0.445408
1,-0.097648,0.00426,-6.244977
2,-0.023289,-0.050225,-3.03257
3,-0.013285,0.02359,3.505886
4,-0.019858,0.040092,-2.88898


We have 3 features that we wish to compress into 2. Let's first visualize how our data looks like in 3-dimensional space:

In [31]:
fig = plt.figure()
ax = fig.add_subplot(111,projection='3d')

ax.plot3D(df['x1'],df['x2'],df['x3'],'k.',label='Raw data')
ax.set_xlabel(r'$x_1$')
ax.set_ylabel(r'$x_2$')
ax.set_zlabel(r'$x_3$')
ax.set_title('3D data')
plt.show()

<IPython.core.display.Javascript object>

As you can see pretty clearly here 2 of our features are highly correlated. This gives us opportunity to compress our data from $3 \to 2$ dimensions without losing too much information. This is because the more correlated features are the less information you lose by removing one of the correlated features. We'll perform compression using **Principal Components Analysis**:

In [62]:
from sklearn.decomposition import PCA

The <code>sklearn.decomposition.PCA</code> module works the same way as many of the supervised learning models. The idea is that we need to *learn* the best "ellipse" of the data in order to discover the principal components. The math behind doing this is not suited for this tutorial but hopefully we've already built an intuition with the theory component:

In [68]:
pc = PCA(n_components=1)
pc.fit(df[['x1','x2','x3']])

PCA(copy=True, iterated_power='auto', n_components=1, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [69]:
pc_data = pc.transform(df[['x1','x2','x3']])

Now we might want to query our PC in order to examine how many components might be useful for our compression step. We can do this by examining the variance explained by each principal component:

In [70]:
pc.explained_variance_

array([20.13386222])

As you can see, much of the original data's variance is explained by a single principal component. So therefore it may make sense to just drop 2 components and keep 1

In [77]:
#Plot compressed data
fig, ax = plt.subplots()
ax.plot(pc_data[:,0],np.zeros_like(pc_data),'k.', label='compressed data')
ax.set_xlabel('PC 1')
ax.set_ylabel('PC 2')
plt.show()

<IPython.core.display.Javascript object>

After performing the data compression, we might want to display the data back in it's original 3-dimensional space. This is achievable with <code>PCA.inverse_transform()</code>

In [78]:
compressed_data = pc.inverse_transform(pc_data)

In [79]:
fig = plt.figure()
ax = fig.add_subplot(111,projection='3d')

ax.plot3D(df['x1'],df['x2'],df['x3'],'k.',label='Raw data')
ax.plot3D(compressed_data[:,0],compressed_data[:,1],compressed_data[:,2],'r.',label='Compressed Data')
ax.set_xlabel(r'$x_1$')
ax.set_ylabel(r'$x_2$')
ax.set_zlabel(r'$x_3$')
ax.set_title('3D data')
plt.show()

<IPython.core.display.Javascript object>