# Introduction
In my previous notebook `2. Text Processing` I tokenized and stemmed the bodies and subjects of 6041 emails from the Spamassassin corpus. 

Eventually I will represent emails as one-hot vectors, with every vector representing a word in the vocabulary. Currently this would lead to a 98401-dimensional space.

In this notebook I will implement Principal Component Analysis, PCA, and use it for dimensionality reduction on my data. 

# Implementing PCA
I will start of by implementing PCA using numpy.

In [1]:
import pandas as pd
import numpy as np

In [219]:
# Implementation using numpys single value decomposition
class PCA_svd():
    def transform(self, X, dims):
        X = X - X.mean(axis=0)
        _, _, V = np.linalg.svd(X - X.mean(axis=0))
        V_dims = V[:dims]
        return X.dot(V_dims.T)
        

In [341]:
# Implementations using correlation matrix
class PCA():
    def fit(self, X):
        self._mean = X.mean(axis=0)
        X = X - self._mean
        X_cov = np.cov(X.T)
        eig_vals, eig_vecs = np.linalg.eig(X_cov)
        sort_index = eig_vals.argsort()[::-1]
        self.eig_vecs = eig_vecs[:,sort_index]
        eig_vals = eig_vals[sort_index]
        self.information = eig_vals/eig_vals.sum()
        
    def transform(self, X, dim=None):
        X = X - self._mean
        if not dim:
            dim = len(self.eig_vecs)
        
        # If dim is set to a ratio, then find out how many dimensions are required to keep that ratio of information
        if 0 <= dim < 1:
            dim = np.argmax(np.cumsum(self.information) > dim)+1
        return X.dot(self.eig_vecs[:,:dim])
        
        
        

Let's test out or implementation by comparing with the the svd implementation and sklearns PCA implementation.

In [342]:
X = np.array([[1, 2], [1,3], [1,5]])

In [343]:
X

array([[1, 2],
       [1, 3],
       [1, 5]])

In [344]:
X.mean(axis=0)

array([ 1.        ,  3.33333333])

In [345]:
from sklearn.decomposition import PCA as sklearnPCA

In [346]:
pca = PCA()

In [347]:
sk_pca = sklearnPCA()

In [348]:
sk_pca.fit_transform(X)

array([[-1.33333333,  0.        ],
       [-0.33333333, -0.        ],
       [ 1.66666667,  0.        ]])

In [349]:
pca.fit(X)

In [350]:
pca.transform(X, 2)

array([[-1.33333333,  0.        ],
       [-0.33333333,  0.        ],
       [ 1.66666667,  0.        ]])

In [351]:
pca_svd = PCA_svd()

In [352]:
pca_svd.transform(X, 2)

array([[-1.33333333,  0.        ],
       [-0.33333333,  0.        ],
       [ 1.66666667,  0.        ]])

Equivalent results for our toy matrix. 

Let's test on a random vector.

In [353]:
from numpy.random import rand

In [354]:
Y = rand(5, 5)

In [355]:
pca.fit(Y)

In [356]:
sk_pca.fit(Y)

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [357]:
np.cumsum(pca.information)

array([ 0.58966147,  0.86115239,  0.96129762,  1.        ,  1.        ])

In [358]:
np.cumsum(sk_pca.explained_variance_ratio_)

array([ 0.58966147,  0.86115239,  0.96129762,  1.        ,  1.        ])

In [359]:
Y_pca = pca.transform(Y)
Y_pca

array([[ -6.76584967e-01,  -2.49731314e-01,   2.07600396e-01,
         -3.56303579e-02,   3.73283151e-16],
       [  5.09866145e-01,   3.27248632e-01,   2.13156614e-01,
         -8.35043113e-02,  -1.64549378e-17],
       [ -2.62786138e-01,   1.86974737e-01,  -2.92669806e-01,
         -1.19341722e-01,   4.12900884e-16],
       [ -1.04531497e-01,   2.27751301e-01,  -3.64434736e-02,
          2.19627966e-01,   1.39547410e-16],
       [  5.34036456e-01,  -4.92243356e-01,  -9.16437310e-02,
          1.88484256e-02,  -1.88742647e-16]])

In [360]:
Y_sk_pca = sk_pca.transform(Y)
Y_sk_pca

array([[  6.76584967e-01,   2.49731314e-01,  -2.07600396e-01,
         -3.56303579e-02,   2.01316271e-16],
       [ -5.09866145e-01,  -3.27248632e-01,  -2.13156614e-01,
         -8.35043113e-02,   5.17459270e-17],
       [  2.62786138e-01,  -1.86974737e-01,   2.92669806e-01,
         -1.19341722e-01,   1.47615485e-16],
       [  1.04531497e-01,  -2.27751301e-01,   3.64434736e-02,
          2.19627966e-01,   2.97259318e-16],
       [ -5.34036456e-01,   4.92243356e-01,   9.16437310e-02,
          1.88484256e-02,  -2.55663004e-18]])

Some dimensions have opposite directions, but this does not matter. All I need is for the principal components to be parallel!

Now let's move on with the project, time to do some dimensionality reduction on the email data!

# Dimensionality reduction on the email word vectors