Principal Component Analysis (PCA) dengan Python

a) Menggunakan numpy

- Import Library

In [1]:
from numpy import linalg as LA
import numpy as np

- Membuat Dataset

In [2]:
data = np.array([[9,39],
                [15, 56], 
                [25, 93], 
                [14, 61], 
                [10, 50], 
                [18, 75],
                [0, 32],
                [16, 85],
                [5, 42],
                [19, 70],
                [16, 66],
                [20, 80]])

- Membuat matrix covariance

In [3]:
covMatX = np.cov(data.T);

In [7]:
covMatX

array([[ 47.71969697, 122.9469697 ],
       [122.9469697 , 370.08333333]])

- Menghitung eigenvector dan eigenvalue

In [4]:
eigenVal, eigenVec = LA.eig(covMatX)

In [5]:
eigenVal

array([  6.18117609, 411.62185422])

In [6]:
eigenVec

array([[-0.94738969, -0.32008244],
       [ 0.32008244, -0.94738969]])

- Menentukan eigenvector yang digunakan untuk mereduksi dimensi data

- Urutkan dari eigenvalue terbesar ke eigenvalue terkecil

- Ambil eigenvector dengan eigenvalue = 411.622, karena nilai rasio sebesar 98,52%

In [8]:
vector = eigenVec[:,1:2]

In [9]:
vector

array([[-0.32008244],
       [-0.94738969]])

- Reduksi Dimensi

In [10]:
dataNew = np.matmul(data, vector)

In [11]:
dataNew

array([[-39.82894   ],
       [-57.85505943],
       [-96.10930248],
       [-62.27192545],
       [-50.57030906],
       [-76.81571091],
       [-30.31647016],
       [-85.64944295],
       [-41.3907793 ],
       [-72.3988449 ],
       [-67.6490388 ],
       [-82.19282426]])

b) Menggunakan Sklearn

- Import Library

In [12]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

- Membuat Dataset

In [13]:
data = np.array([[9, 39], 
 [15, 56], 
 [25, 93], 
 [14, 61], 
 [10, 50], 
 [18, 75],
 [0, 32],
 [16, 85],
 [5, 42],
 [19, 70],
 [16, 66],
 [20, 80]])

- Proses PCA

In [14]:
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(data)

Melihat covariance matrix

In [15]:
pca.get_covariance()

array([[ 47.71969697, 122.9469697 ],
       [122.9469697 , 370.08333333]])

Melihat eigenvalue

In [16]:
pca.explained_variance_

array([411.62185422,   6.18117609])

Melihat eigenvector

In [17]:
pca.components_

array([[-0.32008244, -0.94738969],
       [-0.94738969,  0.32008244]])

Melihat variance ratio

In [18]:
pca.explained_variance_ratio_

array([0.98520553, 0.01479447])

Dari variance ratio dapat menentukan penentuan Principal 
Components (PC) dalam reduksi dimensi.

Nilai variance ratio pertama = 0.98520553, artinya jika kita 
hanya mengambil satu vector pertama saja, ciri yang didapatkan 
adalah sebesar 98,5%

Proses PCA

Dari proses sebelumnya maka kita ambil 1 komponen saja, dapat dilihat pada code berikut.


In [21]:
pca = PCA(n_components=1)
principalComponents = pca.fit_transform(data)
principalDf = pd.DataFrame(data = principalComponents, columns=['principal component 1'])
display(principalDf)

Unnamed: 0,principal component 1
0,23.758447
1,5.732328
2,-32.521915
3,1.315462
4,13.017078
5,-13.228324
6,33.270917
7,-22.062056
8,22.196608
9,-8.811458


**Latihan**

Dataset breast-cancer yang diambil dari situs https://www.kaggle.com/uciml/breast-cancer-wisconsin-data, memiliki 2 kelas yaitu kelas M = malignant (ganas), B = benign (jinak) dan memiliki 30 fitur.

a. Lakukan klasifikasi dengan SVM terhadap dataset breast-cancer.

b. Evaluasi klasifikasi dengan cross validation. 

c. Lakukan normalisasi data.

d. Hitung overall akurasi dengan menggunakan seluruh fitur tanpa menggunakan PCA.

e. Hitung overall akurasi dengan menggunakan reduksi dimensi mengunakan PCA.

Jawab

- Import Library

In [22]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn import svm
from sklearn.model_selection import cross_val_score

- Baca Data dan Pisahkan Data (X) dengan target (y)

In [26]:
df = pd.read_csv('data.csv')

features = ['radius_mean','texture_mean','perimeter_mean','area_mean','smoothness_mean',
 'compactness_mean','concavity_mean','concave points_mean','symmetry_mean',
 'fractal_dimension_mean','radius_se','texture_se','perimeter_se','area_se',
 'smoothness_se','compactness_se','concavity_se','concave points_se',
 'symmetry_se', 'fractal_dimension_se','radius_worst','texture_worst','perimeter_worst','area_worst',
 'smoothness_worst','compactness_worst','concavity_worst','concave points_worst',
 'symmetry_worst','fractal_dimension_worst']

X = df.loc[:,features].values
y = df.loc[:,['diagnosis']].values

- Normalisasi data dengan menggunakan min-max

In [27]:
sc = MinMaxScaler(feature_range=(0,1))
X = sc.fit_transform(X)

- Klasifikasi SVM menggunakan seluruh data dengan seluruh fitur tanpa reduksi dimensi.

In [29]:
clf = svm.SVC(kernel='linear',C=1)
scores = cross_val_score(clf,X,y,cv=5)
print("Accuracy: %0.2f%", (scores.mean()))

Accuracy: %0.2f% 0.9754075454122031


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


- Reduksi dimensi menggunakan PCA

In [30]:
pca = PCA(n_components=30)
principalComponents = pca.fit_transform(X)
variance_ratio = pca.explained_variance_ratio_

In [31]:
variance_ratio

array([5.30976894e-01, 1.72834896e-01, 7.11444201e-02, 6.41125883e-02,
       4.08607204e-02, 3.07149442e-02, 1.58083746e-02, 1.19147161e-02,
       9.88429103e-03, 9.45446106e-03, 8.49396551e-03, 7.57976457e-03,
       6.56638137e-03, 4.74811462e-03, 2.69423338e-03, 2.57754484e-03,
       1.83755588e-03, 1.51271660e-03, 1.37718463e-03, 1.05959242e-03,
       9.83061040e-04, 7.84496266e-04, 5.28060046e-04, 5.09986666e-04,
       4.30073326e-04, 3.29617326e-04, 1.90574049e-04, 5.59104265e-05,
       2.88966877e-05, 5.96453235e-06])

- Dengan 3 PC

In [34]:
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(X)
variance_ratio = pca.explained_variance_ratio_

clf = svm.SVC(kernel='linear',C=1)
scores = cross_val_score(clf,principalComponents,y,cv=5)
print("Accuracy: %0.2f%", (scores.mean()))

Accuracy: %0.2f% 0.9578481602235678


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


- Dengan 5 PC

In [35]:
pca = PCA(n_components=5)
principalComponents = pca.fit_transform(X)
variance_ratio = pca.explained_variance_ratio_

clf = svm.SVC(kernel='linear',C=1)
scores = cross_val_score(clf,principalComponents,y,cv=5)
print("Accuracy: %0.2f%", (scores.mean()))

Accuracy: %0.2f% 0.9736531594472908


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


- Dengan 10 PC

In [36]:
pca = PCA(n_components=10)
principalComponents = pca.fit_transform(X)
variance_ratio = pca.explained_variance_ratio_

clf = svm.SVC(kernel='linear',C=1)
scores = cross_val_score(clf,principalComponents,y,cv=5)
print("Accuracy: %0.2f%", (scores.mean()))

Accuracy: %0.2f% 0.9771619313771154


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


Dari percobaan di atas, kita dapat melihat bahwa hanya dengan tiga dimensi saja akurasi sudah mendekati dengan menggunakan seluruh fitur (30 dimensi). Dan dengan 10 dimensi telah dapat menyamai akurasi dengan menggunakan seluruh fitur (30 dimensi)