This notebook will carry out Principle Component Analysis (PCA) on the entire dataset

**How does PCA work?**
Suppose we have a data set with n features and we want to change it to a m dimensions


1. Identify the hyperplane (a lower dimension (n-1) representation of a data with n features ) that lies closest to data

    1. Choose an axis that minimizes the mean squared distance btween the original dataset and its project onto the axis (look for the axis t hat preserves the alrgest amount of variance)
    
    2. Get a second, third, ... (m)th axis orthogonal to the first,second, thrid... (m-1) that accounts for the largest amount of remaining variance
    
*Implementation*

Note: X is the dataset

```
X_centerd = X-X.mean(axis=0) ### PCA assumes the data is centered

U,s,V = np.linalg.svd(X_centered) ### Carry out Singular Value Decompostion(Matrix Factorization) that can decompose the training set Matrix X into the dot product of three matrices U, SIGMA,and V

c1 = V.T[:,i] ### ith principle component

   
```
2. "Project" the data onto it


*Implementation*

```
W_i = V.T[:,:i] ### W_i defines the matrix containing the first i principle components

X_i = X_centered.dot(W_i) ### X_i stores the dataset with the reduced features

```

In [1]:
##Loading the data to memory

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data_filepath = "../data/UmeaSiblingBundlingMLDistances.csv"

all_data = pd.read_csv(data_filepath)

data = all_data.iloc[:,0:-2]

data.head()

Unnamed: 0,Cosine.FATHER_FORENAME,Cosine.FATHER_SURNAME,Cosine.MOTHER_FORENAME,Cosine.MOTHER_MAIDEN_SURNAME,Cosine.PARENTS_PLACE_OF_MARRIAGE,Cosine.PARENTS_DAY_OF_MARRIAGE,Cosine.PARENTS_MONTH_OF_MARRIAGE,Cosine.PARENTS_YEAR_OF_MARRIAGE,Damerau-Levenshtein.FATHER_FORENAME,Damerau-Levenshtein.FATHER_SURNAME,...,Metaphone-Levenshtein.PARENTS_MONTH_OF_MARRIAGE,Metaphone-Levenshtein.PARENTS_YEAR_OF_MARRIAGE,NYSIIS-Levenshtein.FATHER_FORENAME,NYSIIS-Levenshtein.FATHER_SURNAME,NYSIIS-Levenshtein.MOTHER_FORENAME,NYSIIS-Levenshtein.MOTHER_MAIDEN_SURNAME,NYSIIS-Levenshtein.PARENTS_PLACE_OF_MARRIAGE,NYSIIS-Levenshtein.PARENTS_DAY_OF_MARRIAGE,NYSIIS-Levenshtein.PARENTS_MONTH_OF_MARRIAGE,NYSIIS-Levenshtein.PARENTS_YEAR_OF_MARRIAGE
0,1.0,1.0,0.867,1.0,1.0,1.0,0.784,0.738,0.8,0.875,...,0,0,0.8,0.857,0.833,0.857,0.833,0,0,0
1,0.0,0.0,0.408,0.195,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0.0,0.0,0.0,0.0,0.0,0,0,0
2,0.883,1.0,0.855,1.0,1.0,0.784,0.0,0.738,0.833,0.875,...,0,0,0.8,0.8,0.833,0.857,1.0,0,0,0
3,0.0,0.0,0.0,0.195,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0.0,0.0,0.0,0.0,0.0,0,0,0
4,1.0,0.707,0.893,0.939,1.0,1.0,0.784,0.738,0.8,0.8,...,0,0,0.8,0.8,0.833,0.857,1.0,0,0,0


Using scikit learn's PCA implementation

In [2]:
from sklearn.decomposition import PCA

In [3]:
pca = PCA(n_components=119)
pca.fit(data)

PCA(copy=True, iterated_power='auto', n_components=119, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [4]:
pca.explained_variance_ratio_

array([6.94459155e-01, 1.13266746e-01, 3.50612099e-02, 3.04503417e-02,
       2.59749479e-02, 2.49579518e-02, 2.36934851e-02, 8.62650485e-03,
       5.11414100e-03, 4.22148287e-03, 3.59656101e-03, 3.16073662e-03,
       2.99086933e-03, 2.06484957e-03, 1.88578453e-03, 1.80835913e-03,
       1.76085724e-03, 1.59860440e-03, 1.53874235e-03, 1.51435521e-03,
       1.34833363e-03, 1.18401783e-03, 1.09398392e-03, 1.07111283e-03,
       8.13290691e-04, 6.06879794e-04, 5.72390950e-04, 5.05024184e-04,
       4.84062125e-04, 4.58415050e-04, 3.42274048e-04, 3.40585766e-04,
       3.30056617e-04, 2.97418598e-04, 2.68311926e-04, 2.46250532e-04,
       2.25669129e-04, 2.03779058e-04, 1.99418665e-04, 1.93354424e-04,
       1.66904401e-04, 1.36454939e-04, 1.24941415e-04, 1.06489246e-04,
       8.71202224e-05, 8.39907462e-05, 7.34609903e-05, 6.10768949e-05,
       5.84413079e-05, 5.03208899e-05, 5.01121414e-05, 4.90715537e-05,
       4.76968832e-05, 4.65119848e-05, 4.11770221e-05, 4.03506803e-05,
      

Explained variance ratio tells us the proportion of variance stored within each principle component

Looks like there is only a handful of variables doing this (assuming anything bellow 0.001 is extremely small and there is no reason to keep them around):

In [5]:
pca.components_ ### Returns an array of [n_components, n_features]

array([[ 1.15271974e-01,  1.18101979e-01,  1.00740054e-01, ...,
        -0.00000000e+00, -0.00000000e+00, -0.00000000e+00],
       [-3.20345341e-02, -2.61447220e-02, -2.69279839e-02, ...,
        -0.00000000e+00, -0.00000000e+00, -0.00000000e+00],
       [-1.19473782e-01, -5.14903255e-02, -5.66347580e-02, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00],
       ...,
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  1.00000000e+00,  0.00000000e+00],
       [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00, ...,
         0.00000000e+00,  0.00000000e+00,  1.00000000e+00],
       [ 0.00000000e+00, -3.56318551e-16,  6.61053519e-16, ...,
         0.00000000e+00,  0.00000000e+00,  0.00000000e+00]])

In [6]:
pca = PCA(n_components=23)
pca.fit(data)

PCA(copy=True, iterated_power='auto', n_components=23, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

In [7]:
data_pca = pca.transform(data)

In [8]:
most_important = [np.abs(pca.components_[i]).argmax() for i in range(pca.components_.shape[0])]

In [9]:
most_important

[21,
 20,
 78,
 80,
 83,
 77,
 82,
 23,
 72,
 83,
 81,
 82,
 78,
 77,
 54,
 114,
 107,
 96,
 73,
 53,
 112,
 98,
 105]

In [10]:
data_relevant_featurs = pd.DataFrame()
for i in range (len(most_important)):
    data_relevant_featurs[data.columns[most_important[i]]] = data.iloc[:,most_important[i]]
    

In [11]:
data_relevant_featurs

Unnamed: 0,Jaccard.PARENTS_DAY_OF_MARRIAGE,Jaccard.PARENTS_PLACE_OF_MARRIAGE,JaroWinkler.PARENTS_MONTH_OF_MARRIAGE,LongestCommonSubstring.FATHER_FORENAME,LongestCommonSubstring.MOTHER_MAIDEN_SURNAME,JaroWinkler.PARENTS_DAY_OF_MARRIAGE,LongestCommonSubstring.MOTHER_FORENAME,Jaccard.PARENTS_YEAR_OF_MARRIAGE,JaroWinkler.FATHER_FORENAME,LongestCommonSubstring.FATHER_SURNAME,BagDistance.PARENTS_MONTH_OF_MARRIAGE,NYSIIS-Levenshtein.MOTHER_FORENAME,Metaphone-Levenshtein.MOTHER_MAIDEN_SURNAME,SmithWaterman.FATHER_FORENAME,JaroWinkler.FATHER_SURNAME,BagDistance.PARENTS_DAY_OF_MARRIAGE,NYSIIS-Levenshtein.FATHER_FORENAME,SmithWaterman.MOTHER_FORENAME,Metaphone-Levenshtein.FATHER_SURNAME
0,1.0,1.0,0.3,0.875,0.923,1.0,0.958,0.750,0.517,0.923,0.5,0.833,0.667,0.750,0.569,1.0,0.800,0.833,0.80
1,0.0,0.0,0.0,0.000,0.875,0.0,0.800,0.000,0.000,0.000,0.0,0.000,0.000,0.000,0.000,0.0,0.000,0.077,0.00
2,0.8,1.0,0.0,0.833,0.938,0.3,0.955,0.750,1.000,0.933,0.0,0.833,0.800,0.500,0.500,0.5,0.800,0.800,0.80
3,0.0,0.0,0.0,0.000,0.875,0.0,0.000,0.000,0.000,0.000,0.0,0.000,0.000,0.000,0.000,0.0,0.000,0.000,0.00
4,1.0,1.0,0.3,0.875,0.938,1.0,0.950,0.750,0.517,0.889,0.5,0.833,0.750,0.750,0.250,0.5,0.800,0.750,0.75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
862791,1.0,0.0,0.3,0.917,0.917,1.0,0.947,0.889,0.522,0.933,0.5,0.833,0.800,0.667,0.583,1.0,0.833,0.600,0.75
862792,1.0,0.0,0.0,0.857,0.909,1.0,0.957,0.750,0.489,0.933,0.0,0.857,0.750,0.667,0.583,0.5,0.750,0.800,0.75
862793,0.8,0.0,1.0,0.875,0.923,0.3,0.941,0.571,1.000,0.917,1.0,0.833,0.750,0.750,0.399,0.5,0.750,0.660,0.75
862794,0.8,0.0,1.0,0.923,0.750,0.3,0.941,0.750,1.000,0.923,0.5,0.833,0.000,0.750,0.571,0.5,0.833,0.660,0.80


In [12]:
total_variance = 0

for i in range(len(pca.explained_variance_ratio_)):
    total_variance += pca.explained_variance_ratio_[i]
    
total_variance

0.9913720105795892

Now data_relevant_features is storing featuers that account for 99.1% of all the vatriance

**1 hour researching about PCA and carrying out the tests**

**Total time : 14 hours**

1. (2 hrs) researching about feature selection
2. (4 hours) playing around with top-100 csv file
3. (.5 hours) researching about correlation metrics
4. (1 hour) figuring out pearson correlation assumption tests
5. (2.5 hours) carrying out all the tests done in top-100 csv file in the entire dataset
6. (.5 hours) researching about chi2 feature selection
7. (1 hour) compiling summary so far
8. (1 hour) visualizing statistics about each column
9. (1.5 hour) researching about PCA and carrying out the tests

Of course we can run PCA on all the columns, but another approach we can take is break up the 120 features into their octate groups (that we observed in all_correlation_tests.ipynb) and run PCA on them to retrieve the most "representative" feature. Then we can create another dataset that brings all the "representative" feature from each octate group

In [13]:
octate = []

column_names = "00,01,02,03,04,05,06,07,08,09,10,11,12,13,14".split(",")
column_index = 0

for i in range(len(data.columns)):
    if(int(i/8)==0):
        octate.append(pd.DataFrame())
    
    dataframe = octate[i%8]
    dataframe = dataframe.assign(e=pd.Series(data.iloc[:,i]).values)
    dataframe.rename(columns={"e":column_names[int(i/8)]},inplace=True)
    octate[i%8] = dataframe

In [14]:
octate_pca = pd.DataFrame()

for i in range(len(octate)):
    pca = PCA(n_components=1)
    group_pca  = pca.fit_transform(octate[i])
    octate_pca["Octate " + str(i)] = group_pca.T[0]
#     group_pca = pca.transform(octate[0])
#     most_important = [np.abs(pca.components_[i]).argmax() for i in range(pca.components_.shape[0])]
#     print("Ocatate " + str(i) + " principle component is storing " + str(pca.explained_variance_ratio_[0]) + " of the variance of the group")
#     octate_pca["Octate " + str(i)] = octate[0].iloc[:,most_important[0]]
    


In [15]:
octate_pca

Unnamed: 0,Octate 0,Octate 1,Octate 2,Octate 3,Octate 4,Octate 5,Octate 6,Octate 7
0,-1.086382,-1.347166,-0.974269,-1.198671,2.679217,1.704705,0.645983,0.826867
1,2.022164,1.869678,0.668008,1.176362,-0.804113,-1.575691,-1.367392,-1.182983
2,-1.021147,-1.283944,-0.977266,-1.160833,2.906467,0.471226,-1.367392,0.993613
3,2.022164,1.869678,1.928266,1.176362,-0.804113,-1.575691,-1.367392,-1.182983
4,-1.118184,-0.501730,-1.108156,-1.063629,2.906467,1.282538,0.645983,0.993613
...,...,...,...,...,...,...,...,...
862791,-1.148661,-1.427132,-1.057365,-1.456374,-0.804113,1.704705,0.645983,1.489065
862792,-1.062276,-1.383712,-1.013248,-1.194825,-0.804113,1.282538,-1.367392,0.967725
862793,-1.323312,-0.995814,-0.671961,-1.127002,-0.804113,0.471226,1.906076,0.440951
862794,-1.444701,-1.349518,-0.686701,0.895156,-0.804113,0.471226,1.498702,0.993613


**.5 hour carrying out PCA tests on octates**

**Total time : 14.5 hours**

1. (2 hrs) researching about feature selection
2. (4 hours) playing around with top-100 csv file
3. (.5 hours) researching about correlation metrics
4. (1 hour) figuring out pearson correlation assumption tests
5. (2.5 hours) carrying out all the tests done in top-100 csv file in the entire dataset
6. (.5 hours) researching about chi2 feature selection
7. (1 hour) compiling summary so far
8. (1 hour) visualizing statistics about each column
9. (1.5 hour) researching about PCA and carrying out the tests
10. (.5) carrying out PCA tests on octates

In [16]:
all_data.iloc[:,-2]

0        -1
1         1
2        -1
3         1
4        -1
         ..
862791   -1
862792   -1
862793   -1
862794   -1
862795   -1
Name: link_non-link, Length: 862796, dtype: int64

Playing around with how good PCA was

In [36]:
# from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X = octate_pca
y = all_data.iloc[:,-2]
X_train,X_test,y_train, y_test = train_test_split(X,y,random_state=42)
classifier = RandomForestClassifier()
classifier.fit(X_train,y_train)

classifier.score(X_test,y_test)

0.999995363909893

In [37]:
X = data
y = all_data.iloc[:,-2]
X_train,X_test,y_train, y_test = train_test_split(X,y,random_state=42)
classifier = RandomForestClassifier()
classifier.fit(X_train,y_train)

classifier.score(X_test,y_test)

0.9999860917296789

Not much difference with feature selection??