# Predicting Parkinson's

The goal of this project is to apply PCA to the [Parkinson's Disease Classification dataset](http://archive.ics.uci.edu/ml/datasets/Parkinson%27s+Disease+Classification)\[2\] and then perform classification using the k-nearest neighbors algorithm.

#### Conclusion

Classifying the raw data with KNN before applying PCA yields a CRR of 0.65789 while training the model after reducing the data PCA yields a 0.67763 CRR, which is slightly higher. The higher accuracy when using the reduced data is probably caused by PCA reducing noise and outlying feataures that do not affect the correct prediction, so by removing it the model improves. 

### 1. Prepare and Pre-process 
Load the data `pd_speech_X.csv` and corresponding labels `pd_speech_Y.csv`. Normalize the data by subtracting mean and dividing by variance for each attribute/feature. Split the data and labels into training and testing sets using train/test

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.decomposition import PCA

# load data
X_df = pd.read_csv("pd_speech_X.csv")
Y = pd.read_csv("pd_speech_Y.csv")
X_df.head()

Unnamed: 0,gender,PPE,DFA,RPDE,numPulses,numPeriodsPulses,meanPeriodPulses,stdDevPeriodPulses,locPctJitter,locAbsJitter,...,tqwt_kurtosisValue_dec_27,tqwt_kurtosisValue_dec_28,tqwt_kurtosisValue_dec_29,tqwt_kurtosisValue_dec_30,tqwt_kurtosisValue_dec_31,tqwt_kurtosisValue_dec_32,tqwt_kurtosisValue_dec_33,tqwt_kurtosisValue_dec_34,tqwt_kurtosisValue_dec_35,tqwt_kurtosisValue_dec_36
0,1,0.85247,0.71826,0.57227,240,239,0.008064,8.7e-05,0.00218,1.8e-05,...,1.5466,1.562,2.6445,3.8686,4.2105,5.1221,4.4625,2.6202,3.0004,18.9405
1,1,0.76686,0.69481,0.53966,234,233,0.008258,7.3e-05,0.00195,1.6e-05,...,1.553,1.5589,3.6107,23.5155,14.1962,11.0261,9.5082,6.5245,6.3431,45.178
2,1,0.85083,0.67604,0.58982,232,231,0.00834,6e-05,0.00176,1.5e-05,...,1.5399,1.5643,2.3308,9.4959,10.7458,11.0177,4.8066,2.9199,3.1495,4.7666
3,0,0.41121,0.79672,0.59257,178,177,0.010858,0.000183,0.00419,4.6e-05,...,6.9761,3.7805,3.5664,5.2558,14.0403,4.2235,4.6857,4.846,6.265,4.0603
4,0,0.3279,0.79782,0.53028,236,235,0.008162,0.002669,0.00535,4.4e-05,...,7.8832,6.1727,5.8416,6.0805,5.7621,7.7817,11.6891,8.2103,5.0559,6.1164


In [None]:
X_norm = X_df - np.mean(X_df, axis=0)
X_norm /= np.var(X_norm, axis=0)
X_norm.head()

In [16]:
#split data into test and train
X_train, X_test, Y_train, Y_test = train_test_split(X_norm, Y, test_size=0.2)
print(X_train.shape)
print(Y_train.shape)

print(X_test.shape)

(604, 753)
(604, 1)
(152, 753)


### 2. KNN with raw data
Using the train data and labels, train a k-nearest neighbors classifier with k=3. Report the correct classification rate (CRR) on the test data.

In [17]:
# apply knn
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, Y_train)

# test the model
y_pred_test = neigh.predict(X_test)

ccr_test = (y_pred_test == Y_test['class']).mean()
print('Test ccr = {:.5f}'.format(ccr_test))

Test ccr = 0.65789


  This is separate from the ipykernel package so we can avoid doing imports until


### 3. KNN after PCA 
Apply PCA (use number of components = 5) to the train data. Next, transform the train data to its representation in the lower dimensional space learned by the PCA model. Using this representation, train a k nearest neighbors classifier with k=3. Report the correct classification rate on the test data.

In [18]:
# pca using 5 components
pca = PCA(n_components=5)
pca.fit(X_train)

PCA(copy=True, iterated_power='auto', n_components=5, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [19]:
# apply knn on transformed data
neigh_pca = KNeighborsClassifier(n_neighbors=3)
neigh_pca.fit(pca.transform(X_train), Y_train['class'])

# test the model
y_pred_test = neigh_pca.predict(pca.transform(X_test))
ccr_test = (y_pred_test == Y_test['class']).mean()
print('Test ccr = {:.5f}'.format(ccr_test))

Test ccr = 0.67763


## Citations
1. Sakar, C.O., Serbes, G., Gunduz, A., Tunc, H.C., Nizam, H., Sakar, B.E., Tutuncu, M., Aydin, T., Isenkul, M.E. and Apaydin, H., 2018. A comparative analysis of speech signal processing algorithms for Parkinsonâ€™s disease classification and the use of the tunable Q-factor wavelet transform. Applied Soft Computing, DOI