# PCA – Principal Component Analysis

**PCA** is a method for dimensionality reduction. This method aims to reduce large data sets dimensionality by transforming an extensive collection of variables into a smaller one that contains most of the information from the starting set. This method trades a little accuracy for simplicity.


**Before anything, the libraries used are:**

In [1]:
%matplotlib widget
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt

import common
#Class containing all the data already converted to numeric
from Data import Merge


Before beginning, data requires a selection of the columns to use on PCA

In [2]:
data = Merge().data
print("before treating", len(data))

# remove unused columns
data = data.drop(['RID', 'LDELTOTAL', 'DIGITSCOR', 'VISCODE'], axis=1) 

dataLabels = data.columns.values
dataNumpy = data.to_numpy()

rowsRemove = []
labelsRemove = []



### Remove data incomplete
for i in range(len(dataNumpy)):
    count = 0
    for j in dataNumpy[i]:
        if j != '':
            count += 1
    #if count < len(dataLabels) - 2:
    if count < 32:
        rowsRemove.append(i)

data = data.drop(rowsRemove)

# remove unused columns
for i in dataLabels:
    count = 0
    for j in data[i]:
        if j == '': 
            count += 1
        if count == 300:
            labelsRemove.append(i)
            break
data = data.drop(labelsRemove, axis=1)
print("After treating", len(data))

before treating 15087
After treating 5393


**Before start** - it is required to centre  and scale the data, which will make the average of each variable to be 0 and the standard deviation to be 1 
Formula:  

\begin{equation}
    variation = \frac{ (mesurements-mean)^2 } { Number Of Mesurements }
\end{equation}


In [3]:
scaledData = preprocessing.scale(data.T)

In [13]:
plt.figure(figsize=[18, 10])
pca = PCA()
pca.fit(scaledData)
pcaData = pca.transform(scaledData)

per_var = np.round(pca.explained_variance_ratio_ * 100, decimals=1)
labels = ['PC' + str(x) for x in range(1, len(per_var)+1)]

plt.bar(x=range(1, len(per_var)+1), height=per_var, tick_label=labels)
plt.ylabel('Percentage of Explained Variance')
plt.xlabel('Principal Component')
plt.title('Scree Plot - Principal Components')
plt.tight_layout(pad=2)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [5]:
plt.figure(figsize=[18, 10])

pca_df = pd.DataFrame(pcaData, index=dataLabels, columns=labels)
plt.scatter(pca_df.PC1, pca_df.PC2)
plt.title('PCA Graph')
plt.xlabel('PC1 - {0}%'.format(per_var[0]))
plt.ylabel('PC2 - {0}%'.format(per_var[1]))

for sample in pca_df.index:
    plt.annotate(sample, (pca_df.PC1.loc[sample], pca_df.PC2.loc[sample]))

plt.tight_layout(pad=2)

Canvas(toolbar=Toolbar(toolitems=[('Home', 'Reset original view', 'home', 'home'), ('Back', 'Back to previous …

In [6]:
loading_scores = pd.Series(pca.components_[0])
sorted_loading_scores = loading_scores.abs().sort_values(ascending=False)

top_10 = sorted_loading_scores[0:10].index.values

print(loading_scores[top_10])

2662    0.014615
5021    0.014611
3280    0.014606
2540    0.014602
4674    0.014601
4863    0.014600
4605    0.014599
3179    0.014599
1304    0.014599
5014    0.014599
dtype: float64


In [7]:
len(data)

5393

In [8]:
pca_df

Unnamed: 0,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8,PC9,PC10,...,PC23,PC24,PC25,PC26,PC27,PC28,PC29,PC30,PC31,PC32
AGE,193.891903,-40.639823,-23.7615,-15.738136,-7.038822,3.422943,-2.632141,0.505414,0.407115,-0.106303,...,-0.002746,-0.000592,0.000597,-0.001013,0.000202,0.001923,0.001543,6.655195e-07,1.737244e-06,-6.282475e-14
PTGENDER,-35.885294,2.145832,2.873332,-0.133018,-1.035204,0.272107,-0.426561,-0.694283,0.537996,-0.262369,...,0.127707,-0.016554,-0.057035,0.039062,-0.099905,-0.044049,-0.002693,7.585365e-06,-3.386751e-06,8.170548e-15
PTEDUCAT,15.53189,-10.328651,-4.035142,-1.026432,-0.111511,-1.842114,4.243362,-0.806216,5.706578,3.3499,...,-0.005392,0.004351,0.00462,-0.002496,0.012232,0.001231,-0.001056,-2.421603e-06,8.423572e-08,-1.330533e-15
CDRSB,-34.165529,4.1828,3.175574,-0.877218,0.947517,2.009672,0.472626,0.231863,-0.03318,-0.033004,...,0.126704,0.015613,0.006961,0.003961,-0.053332,0.004897,0.007473,2.087414e-05,-5.328871e-06,-1.283695e-16
ADAS11,-9.819342,6.75762,0.531391,-8.093328,6.768825,-4.28497,-1.719022,2.202688,-0.315081,0.670789,...,0.026662,0.039365,0.026295,-0.02705,0.005351,0.018087,0.006143,1.392382e-05,1.62328e-05,3.137681e-15
ADAS13,4.849367,10.47135,-0.976739,-12.61061,9.798094,-5.041518,-1.344638,1.086866,-0.20537,-0.179787,...,-0.027828,-0.036706,-0.022821,0.025173,-0.002763,-0.01651,-0.005025,-2.712004e-05,-2.760391e-05,1.819074e-15
ADASQ4,-25.387004,5.572876,1.295018,-3.821737,1.72448,0.122413,0.050255,-2.286007,0.352557,-1.36566,...,0.021412,0.046105,0.049588,-0.023693,0.010791,0.015786,0.011024,2.787556e-05,2.84414e-05,2.189221e-15
MMSE,53.563363,-20.645091,-9.058786,-0.734538,-2.448877,-3.750373,3.435376,0.353273,-1.377898,-0.671462,...,0.003891,-0.01044,-0.022985,0.001876,-0.015314,-0.002915,-0.005198,5.026981e-06,1.744032e-06,5.155598e-15
RAVLT_immediate,95.338566,-57.480563,-10.592221,16.704376,11.376254,1.334169,-1.812904,-0.016186,0.526556,-0.471492,...,0.000385,0.002008,0.001818,0.000731,0.001001,-0.000196,-0.00015,-5.49731e-07,-3.762205e-07,3.400492e-15
RAVLT_learning,-21.020523,-5.526592,1.543227,1.946168,-1.550311,-0.290421,-1.799128,-1.9246,-3.505224,5.640761,...,0.005371,0.000519,-0.002774,0.004783,0.000945,0.00039,0.000691,-2.289603e-06,1.333943e-06,1.384309e-15


In [9]:
loading_scores

0       0.014157
1       0.012799
2       0.014498
3       0.013949
4       0.014493
          ...   
5388    0.013717
5389    0.013115
5390    0.012599
5391    0.013308
5392    0.014170
Length: 5393, dtype: float64

In [10]:
len(data)

5393