# Machine Learning Assigment-2 Parkinson dataset

# PROBLEM STATEMENT for K-NN:

The given dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with Parkinson's Disease, according to "status" column which is set to 0 for healthy and 1 for PD.  Dataset can be downloaded from below link.

https://archive.ics.uci.edu/ml/datasets/parkinsons

Create classification model using KNN.  Identify the optimum no of neighbors and dimensions for your model.

Justify if KNN model should be considered or not for the problem statement.

evaluation will be based on..

1)  Handling of missing values, outliers, if any .

2)  Identifying data and model issues if any.

3) Choice of packages and distance measure used. justify your answer.

4) Selection of train, test split.

5) Final model creation and accuracy matrix selected for the model.

6) Future scope of the work.  

For any Queries on this question, Contact: mrath@wilp.bits-pilani.ac.in

Data is available in .data extension. 

# Attribute Information
Matrix column entries (attributes):<br><br>
name - ASCII subject name and recording number<br><br>
MDVP:Fo(Hz) - Average vocal fundamental frequency<br><br>
MDVP:Fhi(Hz) - Maximum vocal fundamental frequency<br><br>
MDVP:Flo(Hz) - Minimum vocal fundamental frequency<br><br>
MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP - Several measures of variation in fundamental frequency<br><br>
MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude<br><br>
NHR,HNR - Two measures of ratio of noise to tonal components in the voice<br><br>
status - Health status of the subject (one) - Parkinson's, (zero) - healthy<br><br>
RPDE,D2 - Two nonlinear dynamical complexity measures<br><br>
DFA - Signal fractal scaling exponent<br><br>
spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation<br><br>

<h3> Import required libraries

In [57]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

<h3>import dataset using pandas

In [58]:
df = pd.read_csv('parkinsons.csv')
df.head(5)

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


In [59]:
# Get the number of rows and columns in the dataset
df.shape

(195, 24)

<h3> Question 1: Handling of missing values, outliers, if any .

In [60]:
# Find the null values
df.isnull().sum()

name                0
MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
status              0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
dtype: int64

In [61]:
# Find NA values
df.isna().sum()

name                0
MDVP:Fo(Hz)         0
MDVP:Fhi(Hz)        0
MDVP:Flo(Hz)        0
MDVP:Jitter(%)      0
MDVP:Jitter(Abs)    0
MDVP:RAP            0
MDVP:PPQ            0
Jitter:DDP          0
MDVP:Shimmer        0
MDVP:Shimmer(dB)    0
Shimmer:APQ3        0
Shimmer:APQ5        0
MDVP:APQ            0
Shimmer:DDA         0
NHR                 0
HNR                 0
status              0
RPDE                0
DFA                 0
spread1             0
spread2             0
D2                  0
PPE                 0
dtype: int64

<h3> Answer: No missing values or outliers found

In [62]:
df.describe()

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
count,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,...,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0
mean,154.228641,197.104918,116.324631,0.00622,4.4e-05,0.003306,0.003446,0.00992,0.029709,0.282251,...,0.046993,0.024847,21.885974,0.753846,0.498536,0.718099,-5.684397,0.22651,2.381826,0.206552
std,41.390065,91.491548,43.521413,0.004848,3.5e-05,0.002968,0.002759,0.008903,0.018857,0.194877,...,0.030459,0.040418,4.425764,0.431878,0.103942,0.055336,1.090208,0.083406,0.382799,0.090119
min,88.333,102.145,65.476,0.00168,7e-06,0.00068,0.00092,0.00204,0.00954,0.085,...,0.01364,0.00065,8.441,0.0,0.25657,0.574282,-7.964984,0.006274,1.423287,0.044539
25%,117.572,134.8625,84.291,0.00346,2e-05,0.00166,0.00186,0.004985,0.016505,0.1485,...,0.024735,0.005925,19.198,1.0,0.421306,0.674758,-6.450096,0.174351,2.099125,0.137451
50%,148.79,175.829,104.315,0.00494,3e-05,0.0025,0.00269,0.00749,0.02297,0.221,...,0.03836,0.01166,22.085,1.0,0.495954,0.722254,-5.720868,0.218885,2.361532,0.194052
75%,182.769,224.2055,140.0185,0.007365,6e-05,0.003835,0.003955,0.011505,0.037885,0.35,...,0.060795,0.02564,25.0755,1.0,0.587562,0.761881,-5.046192,0.279234,2.636456,0.25298
max,260.105,592.03,239.17,0.03316,0.00026,0.02144,0.01958,0.06433,0.11908,1.302,...,0.16942,0.31482,33.047,1.0,0.685151,0.825288,-2.434031,0.450493,3.671155,0.527367


In [63]:
df=df[['name','MDVP:Fo(Hz)','MDVP:Fhi(Hz)','MDVP:Flo(Hz)','MDVP:Jitter(%)','MDVP:Jitter(Abs)','MDVP:RAP','MDVP:PPQ','Jitter:DDP','MDVP:Shimmer','MDVP:Shimmer(dB)','Shimmer:APQ3','Shimmer:APQ5','MDVP:APQ','Shimmer:DDA','NHR','HNR','RPDE','DFA','spread1','spread2','D2','PPE','status']]

<h3> Question 2: Identifying data and model issues if any.

In [64]:
#Drop 'name', all name values has name  unqiue name, and no relation with target
df=df.drop( columns='name')
df.head()

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE,status
0,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,0.426,...,0.06545,0.02211,21.033,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654,1
1,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,0.626,...,0.09403,0.01929,19.085,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674,1
2,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,0.482,...,0.0827,0.01309,20.651,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634,1
3,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,0.517,...,0.08771,0.01353,20.644,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975,1
4,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,0.584,...,0.1047,0.01767,19.649,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335,1


<h2> Answer: Dropped "name" attribute as there are not many numeric variables. Implementing 1 hot encoding was difficult as there are many unique categorical values.

In [65]:
#Change pandas dataframe to numpy array
X = df.iloc[:,:22].values
y = df.iloc[:,-1].values

In [66]:
#Normalizing the data
sc = StandardScaler()
X = sc.fit_transform(X)

<h3> Question 4: Selection of train, test split

In [67]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.1)

<h3>Answer: we are using 90% dataset for training the model and 10% for testing the dataset 

<h3> Question 3: Choice of packages and distance measure used. justify your answer.

<h3> Answer: Packages are listed in [57] , we are using Euclidian distance, Hamming distance, Manhattan distance and Minkowski distance. For K=6 all the four options tried at get optimum accuracy with all except for Hamming distance. So used default choice of Minkowski distance used for other K values to compare accuracy

<h3> The distance between points using distance measures such as Euclidean distance, Hamming distance, Manhattan distance and Minkowski distance

<h3>Question 5: Model creation<br>
Answer: Model for each distance measure between feature was below

<h3> KNN Model K=8

In [68]:
#Start KNN Classifier
knn = KNeighborsClassifier(n_neighbors=8)

#Train the model using the training dataset
knn.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = knn.predict(X_test)

In [69]:
# Accuracy of model
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.85


<h3> KNN Model K=6 default metric='minkowski'

In [70]:
#Create KNN Classifier
knn_min = KNeighborsClassifier(n_neighbors=6)

#Train the model using the training sets
knn_min.fit(X_train, y_train)

#Predict the response for test dataset
y_pred_min = knn_min.predict(X_test)

In [71]:
# Accuracy of Model
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_min))

Accuracy: 0.85


In [72]:
knn_min.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'minkowski',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 6,
 'p': 2,
 'weights': 'uniform'}

<h3> KNN Model K=6 metric='euclidean'

In [73]:
#Create KNN Classifier
knn_eu = KNeighborsClassifier(n_neighbors=6,metric='euclidean')

#Train the model using the training dataset
knn_eu.fit(X_train, y_train)

#Predict the response for test dataset
y_pred_eu = knn_eu.predict(X_test)

In [74]:
# Accuracy of model
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_eu))

Accuracy: 0.85


In [75]:
knn_eu.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'euclidean',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 6,
 'p': 2,
 'weights': 'uniform'}

<h3> KNN Model K=6 metric='manhattan'

In [76]:
#Create KNN Classifier
knn_man = KNeighborsClassifier(n_neighbors=6,metric='manhattan')

#Train the model using the training sets
knn_man.fit(X_train, y_train)

#Predict the response for test dataset
y_pred_man = knn_man.predict(X_test)

In [77]:
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_man))

Accuracy: 0.95


In [78]:
knn_man.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'manhattan',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 6,
 'p': 2,
 'weights': 'uniform'}

<h3> KNN Model K=6 metric='Hamming'

In [79]:
#Create KNN Classifier
knn_ham = KNeighborsClassifier(n_neighbors=6,metric='hamming')

#Train the model using the training sets
knn_ham.fit(X_train, y_train)

#Predict the response for test dataset
y_pred_ham = knn_ham.predict(X_test)

In [80]:
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_ham))

Accuracy: 0.75


In [81]:
knn_ham.get_params()

{'algorithm': 'auto',
 'leaf_size': 30,
 'metric': 'hamming',
 'metric_params': None,
 'n_jobs': None,
 'n_neighbors': 6,
 'p': 2,
 'weights': 'uniform'}

<h3> KNN Model K=4

In [82]:
#Create KNN Classifier
knn_4 = KNeighborsClassifier(n_neighbors=4)

#Train the model using the training sets
knn_4.fit(X_train, y_train)

#Predict the response for test dataset
y_pred_4 = knn_4.predict(X_test)

In [83]:
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_4))

Accuracy: 0.85


<h3> KNN Model K=12

In [84]:
#Create KNN Classifier
knn_12 = KNeighborsClassifier(n_neighbors=12)

#Train the model using the training sets
knn_12.fit(X_train, y_train)

#Predict the response for test dataset
y_pred_12 = knn_12.predict(X_test)

In [85]:
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_12))

Accuracy: 0.75


<h3> KNN Model K=3

In [86]:
#Create KNN Classifier
knn_3 = KNeighborsClassifier(n_neighbors=3)

#Train the model using the training sets
knn_3.fit(X_train, y_train)

#Predict the response for test dataset
y_pred_3 = knn_3.predict(X_test)

In [87]:
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_3))

Accuracy: 0.9


<h3> KNN Model K=2

In [88]:
#Create KNN Classifier
knn_2 = KNeighborsClassifier(n_neighbors=2)

#Train the model using the training sets
knn_2.fit(X_train, y_train)

#Predict the response for test dataset
y_pred_2 = knn_2.predict(X_test)

In [89]:
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_2))

Accuracy: 0.85


<h3> KNN Model K=1

In [92]:
#Create KNN Classifier
knn_1 = KNeighborsClassifier(n_neighbors=1)

#Train the model using the training sets
knn_1.fit(X_train, y_train)

#Predict the response for test dataset
y_pred_1 = knn_1.predict(X_test)

In [93]:
# Model Accuracy
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_1))

Accuracy: 1.0


<h3>Question 6: Future scope of the work. <br>
    Answer: Similar accuracy was observed for many re-runs of model, but accuracy varies with different K values. Need to investigate what make each run with different values of accuracy.