<a href="https://colab.research.google.com/github/jullyoemmanuel/tecnicas-de-machine-learning/blob/main/Final_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Machine Learning applied to the detection of Parkinson's disease

<p><center>
<img src="https://drive.google.com/uc?id=17WdDhNBDmPy7yDv55M-oWSf_AvmDI7jb" style="width:800px"> <br>
</center>
</p>

##1. Intro


> <div class="alert alert-block alert-warning">
<b>About Parkinson's disease:</b> Control of voluntary movement is one of the main functions of the central nervous system. Damage to movement-related areas can lead to damage to the function of voluntary movement control. Such damage can be caused by injuries (eg stroke), or by a process of neurodegeneration.
Parkinson's disease is a neurodegenerative movement disorder characterized by progressive loss of midbrain neurons in the substantia nigra.
</div>

In this project we will use a machine learning algorithm to develop a predictive model to assist in the diagnosis of Parkinson's disease.

##2. About the author

Study developed by Jullyo Emmanuel Vieira Silva, graduating from the Biomedical Engineering degree at Universidade Federal de Pernambuco.

##3. Objectives

The aim of this study is to use supervised learning models with associated learning algorithms to assist the detection of Parkinson's disease.

This dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's disease (PD). Each column in the table is a particular voice measure, and each row corresponds to one of 195 voice recordings from these individuals ("name" column). The main aim of the data is to discriminate healthy people from those with PD, according to the "status" column which is set to 0 for healthy and 1 for PD.

Matrix column entries (attributes):
* name - ASCII subject name and recording number
* MDVP:Fo(Hz) - Average vocal fundamental frequency
* MDVP:Fhi(Hz) - Maximum vocal fundamental frequency
* MDVP:Flo(Hz) - Minimum vocal fundamental frequency
* MDVP:Jitter(%), MDVP:Jitter(Abs), MDVP:RAP, MDVP:PPQ, Jitter:DDP - Several measures of variation in fundamental frequency
* MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA - Several measures of variation in amplitude
* NHR, HNR - Two measures of the ratio of noise to tonal components in the voice
* status - The health status of the subject (one) - Parkinson's, (zero) - healthy
* RPDE, D2 - Two nonlinear dynamical complexity measures
* DFA - Signal fractal scaling exponent
* spread1 ,spread2 , PPE - Three nonlinear measures of fundamental frequency variation

##4. Analysis and results


###4.1 Load libraries

In [None]:
# Linear algebra
import numpy as np

# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.figure_factory as ff
import plotly.graph_objs as go

# Data processing
import pandas as pd

# Algorithms
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn import metrics
from sklearn.metrics import accuracy_score

###4.2 Read files

In [None]:
data = pd.read_csv('/content/parkinsons.csv')

###4.3 Information about the data

In [None]:
data.head() # first five rows of the dataframe

Unnamed: 0,name,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
0,phon_R01_S01_1,119.992,157.302,74.997,0.00784,7e-05,0.0037,0.00554,0.01109,0.04374,...,0.06545,0.02211,21.033,1,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,phon_R01_S01_2,122.4,148.65,113.819,0.00968,8e-05,0.00465,0.00696,0.01394,0.06134,...,0.09403,0.01929,19.085,1,0.458359,0.819521,-4.075192,0.33559,2.486855,0.368674
2,phon_R01_S01_3,116.682,131.111,111.555,0.0105,9e-05,0.00544,0.00781,0.01633,0.05233,...,0.0827,0.01309,20.651,1,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,phon_R01_S01_4,116.676,137.871,111.366,0.00997,9e-05,0.00502,0.00698,0.01505,0.05492,...,0.08771,0.01353,20.644,1,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,phon_R01_S01_5,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,...,0.1047,0.01767,19.649,1,0.417356,0.823484,-3.747787,0.234513,2.33218,0.410335


In [None]:
# getting adicional information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   name              195 non-null    object 
 1   MDVP:Fo(Hz)       195 non-null    float64
 2   MDVP:Fhi(Hz)      195 non-null    float64
 3   MDVP:Flo(Hz)      195 non-null    float64
 4   MDVP:Jitter(%)    195 non-null    float64
 5   MDVP:Jitter(Abs)  195 non-null    float64
 6   MDVP:RAP          195 non-null    float64
 7   MDVP:PPQ          195 non-null    float64
 8   Jitter:DDP        195 non-null    float64
 9   MDVP:Shimmer      195 non-null    float64
 10  MDVP:Shimmer(dB)  195 non-null    float64
 11  Shimmer:APQ3      195 non-null    float64
 12  Shimmer:APQ5      195 non-null    float64
 13  MDVP:APQ          195 non-null    float64
 14  Shimmer:DDA       195 non-null    float64
 15  NHR               195 non-null    float64
 16  HNR               195 non-null    float64
 1

In [None]:
# number of rows and columns
data.shape

(195, 24)

In [None]:
# statistical measures about the data
data.describe()

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
count,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,...,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0
mean,154.228641,197.104918,116.324631,0.00622,4.4e-05,0.003306,0.003446,0.00992,0.029709,0.282251,...,0.046993,0.024847,21.885974,0.753846,0.498536,0.718099,-5.684397,0.22651,2.381826,0.206552
std,41.390065,91.491548,43.521413,0.004848,3.5e-05,0.002968,0.002759,0.008903,0.018857,0.194877,...,0.030459,0.040418,4.425764,0.431878,0.103942,0.055336,1.090208,0.083406,0.382799,0.090119
min,88.333,102.145,65.476,0.00168,7e-06,0.00068,0.00092,0.00204,0.00954,0.085,...,0.01364,0.00065,8.441,0.0,0.25657,0.574282,-7.964984,0.006274,1.423287,0.044539
25%,117.572,134.8625,84.291,0.00346,2e-05,0.00166,0.00186,0.004985,0.016505,0.1485,...,0.024735,0.005925,19.198,1.0,0.421306,0.674758,-6.450096,0.174351,2.099125,0.137451
50%,148.79,175.829,104.315,0.00494,3e-05,0.0025,0.00269,0.00749,0.02297,0.221,...,0.03836,0.01166,22.085,1.0,0.495954,0.722254,-5.720868,0.218885,2.361532,0.194052
75%,182.769,224.2055,140.0185,0.007365,6e-05,0.003835,0.003955,0.011505,0.037885,0.35,...,0.060795,0.02564,25.0755,1.0,0.587562,0.761881,-5.046192,0.279234,2.636456,0.25298
max,260.105,592.03,239.17,0.03316,0.00026,0.02144,0.01958,0.06433,0.11908,1.302,...,0.16942,0.31482,33.047,1.0,0.685151,0.825288,-2.434031,0.450493,3.671155,0.527367


In [None]:
data['status'].value_counts()

1    147
0     48
Name: status, dtype: int64

1 - Parkinson's positive

0 - Healthy

In [None]:
P = data[(data['status'] != 0)]
H = data[(data['status'] == 0)]

In [None]:
#------------COUNT-----------------------
def target_count():
    trace = go.Bar( x = data['status'].value_counts().values.tolist(),
                    y = ['Has Parkinson','Healthy'],
                    orientation = 'h',
                    text=data['status'].value_counts().values.tolist(),
                    textfont=dict(size=15),
                    textposition = 'auto',
                    opacity = 0.8,marker=dict(
                    color=['gold','lightskyblue'],
                    line=dict(color='#000000',width=1.5)))

    layout = dict(title =  'Count of status variable')

    fig = dict(data = [trace], layout=layout)
    py.iplot(fig)

#------------PERCENTAGE-------------------
def target_percent():
    trace = go.Pie(labels = ['Has Parkinson','Healthy'], values = data['status'].value_counts(),
                   textfont=dict(size=15), opacity = 0.8,
                   marker=dict(colors=['gold','lightskyblue'],
                               line=dict(color='#000000', width=1.5)))


    layout = dict(title =  'Distribution of status variable')

    fig = dict(data = [trace], layout=layout)
    py.iplot(fig)

In [None]:
target_count()
target_percent()

A correlation matrix is a table showing correlation coefficients between sets of variables. Each random variable ($X_i$) in the table is correlated with each of the other values in the table ($X_j$). This allows you to see which pairs have the highest correlation.

In [None]:
#status = ['Has Parkinson','Healthy']
#colors = ['#FFD700', '#7EC0EE']
#plt.style.use('fivethirtyeight')
#p1 = plt.bar(status,data['status'].value_counts(), color = colors,  edgecolor = 'black')
#plt.ylabel('Voice recordings', fontsize = 12)
#plt.title("Ratio between people with parkinson's disease and healthy people in the dataset", fontsize = 15)

#for rect1 in p1:
#    height = rect1.get_height()
#    plt.annotate( "{}%".format(round((height/195)*100,2)),(rect1.get_x() + rect1.get_width()/2, height/2),ha="center",va="bottom",fontsize=13)

#plt.show()

In [None]:
# correlation matrix
cor = data.corr()

matrix_cols = cor.columns.tolist()
    #convert to array
corr_array  = np.array(cor)
trace = go.Heatmap(z = corr_array,
                       x = matrix_cols,
                       y = matrix_cols,
                       colorscale='Viridis',
                       colorbar   = dict() ,
                      )

layout = go.Layout(dict(title = 'Correlation Matrix for variables',
                            #autosize = False,
                            #height  = 1400,
                            #width   = 1600,
                            margin  = dict(r = 0 ,l = 100,
                                           t = 0,b = 100,
                                         ),
                            yaxis   = dict(tickfont = dict(size = 9)),
                            xaxis   = dict(tickfont = dict(size = 9)),
                           )
                      )
fig = go.Figure(data = [trace],layout = layout)
py.iplot(fig)

In [None]:
# grouping the data based on the target variable

data.groupby('status').mean()

Unnamed: 0_level_0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,MDVP:APQ,Shimmer:DDA,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE
status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,181.937771,223.63675,145.207292,0.003866,2.3e-05,0.001925,0.002056,0.005776,0.017615,0.162958,...,0.013305,0.028511,0.011483,24.67875,0.442552,0.695716,-6.759264,0.160292,2.154491,0.123017
1,145.180762,188.441463,106.893558,0.006989,5.1e-05,0.003757,0.0039,0.011273,0.033658,0.321204,...,0.0276,0.053027,0.029211,20.974048,0.516816,0.725408,-5.33342,0.248133,2.456058,0.233828


In [None]:
def plot_distribution(data_select, size_bin) :
    # 2 datasets
    tmp1 = P[data_select]
    tmp2 = H[data_select]
    hist_data = [tmp1, tmp2]

    group_labels = ['Has Parkinson', 'Healthy']
    colors = ['#FFD700', '#7EC0EE']

    fig = ff.create_distplot(hist_data, group_labels, colors = colors, show_hist = True, bin_size = size_bin, curve_type='kde')

    fig['layout'].update(title = data_select)

    py.iplot(fig, filename = 'Density plot')

In [None]:
# Distribution of the average vocal fundamental frequency

plot_distribution('MDVP:Fo(Hz)', 4)

In [None]:
# Distribution of the maximum vocal fundamental frequency

plot_distribution('MDVP:Fhi(Hz)', 4)

In [None]:
# Distribution of the minimum vocal fundamental frequency

plot_distribution('MDVP:Flo(Hz)', 5)

In [None]:
# Distribution of several measures of variation in amplitude

plot_distribution('MDVP:Jitter(%)', 0)
plot_distribution('MDVP:Jitter(Abs)', 0)
plot_distribution('MDVP:RAP', 0)
plot_distribution('MDVP:PPQ', 0)
plot_distribution('Jitter:DDP', 0)

In [None]:
# Distribution of several measures of variation in fundamental frequency

plot_distribution('MDVP:Shimmer', 0)
plot_distribution('MDVP:Shimmer(dB)', 0)
plot_distribution('Shimmer:APQ3', 0)
plot_distribution('Shimmer:APQ5', 0)
plot_distribution('MDVP:APQ', 0)
plot_distribution('Shimmer:DDA', 0)

In [None]:
# Distribution of two measures of the ratio of noise to tonal components in the voice

plot_distribution('NHR', 0)
plot_distribution('HNR', 0)

In [None]:
# Distribution of two nonlinear dynamical complexity measures

plot_distribution('RPDE', 0)
plot_distribution('D2', 0)

In [None]:
# Distribution of the signal fractal scaling exponent

plot_distribution('DFA', 0)

In [None]:
# Distribution three nonlinear measures of fundamental frequency variation

plot_distribution('spread1', 0)
plot_distribution('spread2', 0)
plot_distribution('PPE', 0)

###4.4 Data pre-processing

In [None]:
# separating features and the target variable
X = data.drop(columns = ['status','name'])
Y = data['status']

In [None]:
X

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,MDVP:APQ,Shimmer:DDA,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE
0,119.992,157.302,74.997,0.00784,0.00007,0.00370,0.00554,0.01109,0.04374,0.426,...,0.02971,0.06545,0.02211,21.033,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654
1,122.400,148.650,113.819,0.00968,0.00008,0.00465,0.00696,0.01394,0.06134,0.626,...,0.04368,0.09403,0.01929,19.085,0.458359,0.819521,-4.075192,0.335590,2.486855,0.368674
2,116.682,131.111,111.555,0.01050,0.00009,0.00544,0.00781,0.01633,0.05233,0.482,...,0.03590,0.08270,0.01309,20.651,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634
3,116.676,137.871,111.366,0.00997,0.00009,0.00502,0.00698,0.01505,0.05492,0.517,...,0.03772,0.08771,0.01353,20.644,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975
4,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,0.584,...,0.04465,0.10470,0.01767,19.649,0.417356,0.823484,-3.747787,0.234513,2.332180,0.410335
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,174.188,230.978,94.261,0.00459,0.00003,0.00263,0.00259,0.00790,0.04087,0.405,...,0.02745,0.07008,0.02764,19.517,0.448439,0.657899,-6.538586,0.121952,2.657476,0.133050
191,209.516,253.017,89.488,0.00564,0.00003,0.00331,0.00292,0.00994,0.02751,0.263,...,0.01879,0.04812,0.01810,19.147,0.431674,0.683244,-6.195325,0.129303,2.784312,0.168895
192,174.688,240.005,74.287,0.01360,0.00008,0.00624,0.00564,0.01873,0.02308,0.256,...,0.01667,0.03804,0.10715,17.883,0.407567,0.655683,-6.787197,0.158453,2.679772,0.131728
193,198.764,396.961,74.904,0.00740,0.00004,0.00370,0.00390,0.01109,0.02296,0.241,...,0.01588,0.03794,0.07223,19.020,0.451221,0.643956,-6.744577,0.207454,2.138608,0.123306


In [None]:
Y

0      1
1      1
2      1
3      1
4      1
      ..
190    0
191    0
192    0
193    0
194    0
Name: status, Length: 195, dtype: int64

###4.5 Spliting the data to train data and test data

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1)

In [None]:
print(X.shape,X_train.shape, X_test.shape)

(195, 22) (156, 22) (39, 22)


###4.6 Data standarlization

In [None]:
scaler = StandardScaler()

scaler.fit(X_train)

StandardScaler()

In [None]:
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
print(X_train)

[[-0.61443971 -0.61780655  0.04127923 ...  0.5029092  -0.40751375
  -0.68223215]
 [-0.04034851 -0.35415175 -0.98088597 ... -0.28958497 -0.65415395
  -0.42818012]
 [ 1.52915319  0.43526029 -0.54728793 ...  1.93233825  1.357776
   0.13991703]
 ...
 [-0.84047179 -0.61957265 -0.1354869  ...  1.21146574 -0.45451571
  -0.2250194 ]
 [ 0.3875642   0.83251984 -0.8922878  ...  2.03352299  1.45315514
   0.77558086]
 [ 0.52924939 -0.10330959  1.11583374 ... -0.37674221  0.37893111
  -0.39314388]]


###4.7 Model training

In [None]:
# Support Vector Machine Model

model = svm.SVC(kernel = 'linear')

model.fit(X_train,Y_train)

SVC(kernel='linear')

###4.8 Model evaluation

In [None]:
# accuracy score on training data

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)

print('The accuracy score of the training data is: ',(round(training_data_accuracy,3))*100,'%' )

The accuracy score of the training data is:  90.4 %


In [None]:
# accuracy score on test data

X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)

print('The accuracy score of the test data is: ',(round(test_data_accuracy,3))*100,'%' )

The accuracy score of the test data is:  84.6 %


In [None]:
print(metrics.classification_report(Y_test, X_test_prediction))

              precision    recall  f1-score   support

           0       0.70      0.70      0.70        10
           1       0.90      0.90      0.90        29

    accuracy                           0.85        39
   macro avg       0.80      0.80      0.80        39
weighted avg       0.85      0.85      0.85        39



###4.9 Building a predictive system

In [None]:
input_data = (120.26700,137.24400,114.82000,0.00333,0.00003,0.00155,0.00202,0.00466,0.01608,0.14000,0.00779,0.00937,0.01351,0.02337,0.00607,24.88600,0.596040,0.764112,-5.634322,0.257682,1.854785,0.211756)

# changing the input data to a numpy array

input_data_as_numpy_array = np.asarray(input_data)

# reshape the numpy array
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

std_data = scaler.transform(input_data_reshaped)

prediction = model.predict(std_data)

if prediction[0] == 1:
  print('The person has Parkinson')
else:
  print('The person does not have Parkinson')

The person has Parkinson



X does not have valid feature names, but StandardScaler was fitted with feature names



##5. Conclusions

SVM or Support vector machine is one of the most robust prediction methods, being based on statistical learning frameworks. In this study were given a set of training examples, each marked as belonging to one of two categories, people with Parkinson's disease and healthy people, then the SVM training algorithm builded a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier.

SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. Therefore, new inputs are mapped into that same space and predicted to belong to a category based on which side of the gap they fall, which means that is possible to predict if the person has Parkinson with certain accuracy.

The test data accuracy score was 84.6% and the training data accuracy score was 90.4%, which is satisfactory.

##6. References

https://www.kaggle.com/datasets/vikasukani/parkinsons-disease-data-set

https://seer.ufrgs.br/index.php/rita/article/view/rita_v14_n2_p43-67/3543

https://www.kaggle.com/code/vincentlugat/pima-indians-diabetes-eda-prediction-0-906

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html