# Table of Contents
<a id="toc"></a>
- [1.Introduction](#1)
- [2. UPDRS](#2)

<a id="1"></a>
# **<center><span style="color:#00BFC1;">Introduction</span></center>**

* Parkinson Disease is the second most common neurodegenrative disease after Alzheimer's disease.
* Parkinson's disease (PD) is a type of movement or non movement disorder.
* It happens when nerve cells in the brain don't produce enough of a brain chemical called dopamine.
* There is no specific test for PD, so it can be difficult to diagnose. 
* Doctors use a medical history and a neurological examination to diagnose it.

<a id="2"></a>
# **<center><span style="color:#00BFC1;">UPDRS</span></center>**

* The UPDRS scale refers to Unified Parkinson Disease Rating Scale, and it is a rating tool used to gauge the course of Parkinson’s disease in patients. 
* The UPDRS scale has been modified over the years by several medical organizations, and continues to be one of the bases of treatment and research in PD clinics. 
* The UPDRS scale includes series of ratings for typical Parkinson’s symptoms that cover all of the movement hindrances of Parkinson’s disease.
* The UPDRS scale consists of mainly 3 segments:
    1. Mentation, Behavior and Mood
    2. Activities of Daily Living (ADL)
    3. Motor
        
* Vocal impairment is one of the most important signs of PD since it is seen in approximately 90% of the patients in the earlier stages of the disease
* Parkinson Disease tele-monitoring studies based on speech recordings of PD patients aim to map the vocal features to a clinical evaluation system used to describe how the signs of Parkinson's disease progress.
* UPDRS is the most widely used scale, many researches are trying to estimate the whole or a part of the UPDRS score using data that is retrieved by teleprocessing. 
* The effect of speech shows up in two components: primarily in the 5th section of component 2 for assessing whether the patient’s vocal output is apprehensible and secondly in the 18th section of component 3 for evaluating whether the patient’s vocal output is expressive during a conversation.
* tele-monitoring of signs can complement traditional clinical examinations and decrease the number of physical visits to clinics.
* The PD patients were monitored for a six-month period, and remained un-medicated during the duration of the study.
* The voice recordings of the subjects were obtained at weekly intervals for the six-month duration of the study whereas motor and total UPDRS were assessed only three times by the medical staff: at baseline (onset of trial), and after three and six months.
* The missing weekly UPDRS estimates corresponding to the weekly voice recordings were obtained using linear interpolation.
* During the six months data collection period, in each trial, six sustained phonations of the vowel /a/ were recorded summing up to 5875 voice recordings.

Erdogdu Sakar B, Serbes G, Sakar CO
(2017) Analyzing the effectiveness of vocal features
in early telediagnosis of Parkinson’s disease. PLoS
ONE 12(8): e0182428. https://doi.org/10.1371/
journal.pone.0182428

https://journals.plos.org/plosone/article/figure?id=10.1371/journal.pone.0182428.t001
    

<a id="2"></a>
# **<center><span style="color:#00BFC1;">Dataset Description</span></center>**

Features Include Subject Age, Gender, Time Interval from baseline recruitment date and 16 other biomedical voice measurements using Telemonitoring Device.

![Dataset](./Dataset.png)

## Importing Modules

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import warnings

from time import time
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, SGDRegressor, Lasso, ElasticNet
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score,mean_absolute_error,explained_variance_score
from sklearn.decomposition import PCA
from sklearn.svm import SVR
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.decomposition import PCA

%matplotlib inline
warnings.filterwarnings('ignore')
pd.set_option("display.precision", 8)

FIGURESIZE=(20,15)
FONTSIZE=12
plt.rcParams['figure.figsize'] = FIGURESIZE
plt.rcParams['font.size'] = FONTSIZE
plt.rcParams['xtick.labelsize'] = FONTSIZE
plt.rcParams['ytick.labelsize'] = FONTSIZE
scaler = StandardScaler()

In [None]:
data = pd.read_csv('parkinsons_updrs.csv.data')
data.head()

### Split Data to Train and Test

In [None]:
def split_data(data):
    train, test = train_test_split(data, test_size=0.3, random_state=42)
    return train, test

### Dropping rows where test_time is Negative

In [None]:
def clean_data(data):
    
    # Drop rows where test time is negative
    data = data[data['test_time']>0]
    
    # Convert subject#, sex to categorical type
    data[['subject#','sex']] = data[['subject#','sex']].astype("category")
    
    # set subject# as index
    data.set_index('subject#')
    return data

In [None]:
# Plot pie chart for gender
data.groupby('sex').size().plot(kind='pie', autopct='%.2f',labels=['Male','Female'], label="")

In [None]:
#plot age distribution
data.age.plot.hist(bins=20,legend=True)

In [None]:
data.total_UPDRS.plot.hist(bins=20,legend=True,label='Total UPDRS')
plt.savefig('hist_Total_UPDRS.png')

In [None]:
data.motor_UPDRS.plot.hist(bins=20,legend=True,label='Motor UPDRS')
plt.savefig('hist_motor_UPDRS.png')

### Generate Profile Report with Various Analysis

In [None]:
def save_profile_report(data, path='EDA.html'):
    ProfileReport(data).to_file(path)

In [None]:
data = clean_data(data)

In [None]:
train,test = split_data(data)

In [None]:
train.drop('subject#',axis=1, inplace=True)

In [None]:
total_UPDRS = train['total_UPDRS']
motor_UPDRS = train['motor_UPDRS']
X_train = train.drop(['total_UPDRS','motor_UPDRS'],axis=1)

test_total_UPDRS = test['total_UPDRS']
test_motor_UPDRS = test['motor_UPDRS']
X_test = test.drop(['total_UPDRS','motor_UPDRS'],axis=1)

In [None]:
# subject_test = X_test['subject#']
X_test.drop('subject#',inplace=True, axis=1)

In [None]:
sns.pairplot(X_train)
plt.savefig('Pairplot.png',bbox_inches='tight', dpi=150)

In [None]:
corr = X_train.corr()

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(corr, cmap="magma_r",annot=True,fmt='.1f')
plt.savefig('Correlation_plot.png',bbox_inches='tight', dpi=150)

### Feature Transformation

In [None]:
def scale_data(data,train=False):
    if train:
        return scaler.fit_transform(data)
    else:
        return scaler.transform(data)

In [None]:
cols = X_train.columns
X_train = scale_data(X_train, train=True)
X_test = scale_data(X_test)


In [None]:
pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.DataFrame(X_train,columns =cols).describe()


# Feature selection -- F test 

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

def select_features_on_f_test(X_train, y_train):

    f_model = SelectKBest(score_func=f_regression, k='all')
   
    f_model.fit(X_train, y_train)
    
    X_train_fs = f_model.transform(X_train)
    
    return X_train_fs, f_model

X_train_fs_total, f_model_total = select_features_on_f_test(X_train,total_UPDRS)
X_train_fs_motor, f_model_motor = select_features_on_f_test(X_train,motor_UPDRS)

In [None]:
feature_scores_total = sorted(list(zip(cols,f_model_total.scores_)),key=lambda x:x[1], reverse=True)
for feature,score in feature_scores_total:
    print("Importance of feature", '\033[1m','\033[96m',feature,'\033[0m'," w.r.t Total UPDRS is: ",score)

In [None]:
plt.bar(cols,f_model_total.scores_)
plt.xticks(rotation='vertical')
plt.xlabel('Features')
plt.ylabel('F-score w.r.t Total UPDRS')
plt.savefig('F_Test_total.png',bbox_inches='tight',dpi=150)

In [None]:
feature_scores_motor = sorted(list(zip(cols,f_model_motor.scores_)),key=lambda x:x[1], reverse=True)
for feature,score in feature_scores_motor: 
    print("Importance of feature",'\033[1m','\033[96m', feature,'\033[0m'," w.r.t \033[92m Motor UPDRS \033[0m is: ",score)

In [None]:
plt.bar(cols,f_model_motor.scores_)
plt.xticks(rotation='vertical')
plt.xlabel('Features')
plt.ylabel('F-score w.r.t Motor UPDRS')
plt.savefig('F_Test_motor.png',bbox_inches='tight',dpi=150)

# Feature Selection - Mutual Information 

In [None]:
from sklearn.feature_selection import mutual_info_regression

def select_features_on_mir(X_train, y_train):

    mir_model = SelectKBest(score_func=mutual_info_regression, k='all')
   
    mir_model.fit(X_train, y_train)
    
    X_train_mir = mir_model.transform(X_train)
    
    return X_train_mir, mir_model

X_train_mir_total, mir_model_total = select_features_on_mir(X_train,total_UPDRS)
X_train_mir_motor, mir_model_motor = select_features_on_mir(X_train,motor_UPDRS)

In [None]:
feature_scores_motor = sorted(list(zip(cols,mir_model_motor.scores_)),key=lambda x:x[1], reverse=True)
print('******* Mutual Information Feature Selection ********\n')
for feature,score in feature_scores_motor:
    print("Importance of feature",'\033[1m','\033[96m', feature,'\033[0m' ," w.r.t \033[95m Motor UPDRS \033[0m is: ",score)

In [None]:
plt.bar(cols,mir_model_motor.scores_)
plt.xticks(rotation='vertical')
plt.xlabel('Features')
plt.ylabel('Mutual Information-score w.r.t Motor UPDRS')
plt.savefig('MIR_Test_motor.png',bbox_inches='tight',dpi=150)

In [None]:
feature_scores_motor = sorted(list(zip(cols,mir_model_total.scores_)),key=lambda x:x[1], reverse=True)
for feature,score in feature_scores_motor:
    print("Importance of feature", '\033[1m','\033[96m', feature,'\033[0m'," w.r.t \033[95m Total UPDRS \033[0m is: ",score)

In [None]:
plt.bar(cols,mir_model_total.scores_)
plt.xticks(rotation='vertical')
plt.xlabel('Features')
plt.ylabel('MIR score w.r.t Total UPDRS')
plt.savefig('MIR_test_total.png',bbox_inches='tight',dpi=150)

# Recursive Feature Elimination

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import Ridge, Lasso, RidgeCV, LassoCV

## RFE - Ridge Regression

In [None]:
model = Ridge()

rfe = RFE(model, n_features_to_select=0.5)

X_rfe = rfe.fit_transform(X_train,total_UPDRS)  

model.fit(X_rfe,total_UPDRS)

result_filter = [i for i, x in enumerate(rfe.support_) if x]
rfe_columns = []
for item in result_filter:
    rfe_columns.append(item)
    
X_train_rfe_ridge = np.take(X_train,rfe_columns,axis=1)
X_test_rfe_ridge = np.take(X_test,rfe_columns,axis=1)

## RFE - Lasso

In [None]:
model = Lasso()

rfe = RFE(model)
X_rfe = rfe.fit_transform(X_train, total_UPDRS)  
model.fit(X_rfe,total_UPDRS)


result_filter = [i for i, x in enumerate(rfe.support_) if x]

rfe_columns = []
for item in result_filter:
    rfe_columns.append(item)
    
X_train_rfe_lasso = np.take(X_train,rfe_columns,axis=1)
X_test_rfe_lasso = np.take(X_test,rfe_columns,axis=1)

## RFE - Extra Trees Regressor

In [None]:
model = ExtraTreesRegressor()

rfe = RFE(model)
X_rfe = rfe.fit_transform(X_train,total_UPDRS)  
model.fit(X_rfe,total_UPDRS)


result_filter = [i for i, x in enumerate(rfe.support_) if x]

rfe_columns = []
for item in result_filter:
    rfe_columns.append(item)
    
X_train_rfe_xtree = np.take(X_train,rfe_columns,axis=1)
X_test_rfe_xtree = np.take(X_test,rfe_columns,axis=1)

In [None]:
rfe.ranking_

### Principal Component Analysis

In [None]:
pca = PCA(n_components=8)
pca_out = pca.fit(X_train)
print(pca_out.explained_variance_ratio_)
pca_x = pca.transform(X_train)
pca_xtest = pca.transform(X_test)

In [None]:
num_pc = pca_out.n_features_
pc_list = ["PC"+str(i) for i in list(range(1, 8+1))]

In [None]:
plt.plot(pc_list, pca_out.explained_variance_ratio_, 'o-', linewidth=2, color='blue')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')

# Model Fit and Testing

## Fit Multiple Models and select best model for Hyper parameter tuning

In [None]:
models = [LinearRegression(),
                 Lasso(),
                 Ridge(), 
                 ElasticNet(), 
                 KNeighborsRegressor(),
                 GradientBoostingRegressor(),
                 ExtraTreesRegressor(),
                 RandomForestRegressor(),
                 DecisionTreeRegressor(),
                 SVR(kernel='rbf',)
                ]

print("\n********* Model Results on PCA Components -- Total UPDRS **********\n")
for model in models:
    
    model.fit(pca_x, total_UPDRS)
    y_pred = model.predict(pca_xtest)
    
    print(model)
    print("\tExplained variance:", explained_variance_score(test_total_UPDRS, y_pred))
    print("\tMean absolute error:", mean_absolute_error(test_total_UPDRS, y_pred))
    print("\tRoot Mean Square Error:", np.sqrt(mean_squared_error(test_total_UPDRS, y_pred)))
    print("\tR2 score:", r2_score(test_total_UPDRS, y_pred))
    print()
    

print("\n********* Model Results on Original Features -- Total UPDRS **********\n")    
for model in models:
    
    model.fit(X_train, total_UPDRS)
    y_pred = model.predict(X_test)
    
    print(model)
    print("\tExplained variance:", explained_variance_score(test_total_UPDRS, y_pred))
    print("\tMean absolute error:", mean_absolute_error(test_total_UPDRS, y_pred))
    print("\tRoot Mean Square Error:", np.sqrt(mean_squared_error(test_total_UPDRS, y_pred)))
    print("\tR2 score:", r2_score(test_total_UPDRS, y_pred))
    print()

In [None]:
print("\n********* Model Results on PCA Components -- Motor UPDRS **********\n")
for model in models:
    
    model.fit(pca_x, motor_UPDRS)
    y_pred = model.predict(pca_xtest)
    
    print(model)
    print("\tExplained variance:", explained_variance_score(test_motor_UPDRS, y_pred))
    print("\tMean absolute error:", mean_absolute_error(test_motor_UPDRS, y_pred))
    print("\tRoot Mean Square Error:", np.sqrt(mean_squared_error(test_motor_UPDRS, y_pred)))
    print("\tR2 score:", r2_score(test_motor_UPDRS, y_pred))
    print()
    

print("\n********* Model Results on Original Features -- Motor UPDRS **********\n")    
for model in models:
    
    model.fit(X_train, motor_UPDRS)
    y_pred = model.predict(X_test)
    
    print(model)
    print("\tExplained variance:", explained_variance_score(test_motor_UPDRS, y_pred))
    print("\tMean absolute error:", mean_absolute_error(test_motor_UPDRS, y_pred))
    print("\tRoot Mean Square Error:", np.sqrt(mean_squared_error(test_motor_UPDRS, y_pred)))
    print("\tR2 score:", r2_score(test_motor_UPDRS, y_pred))
    print()

In [None]:
print("\n********* Model Results on Feature Selection using FTest -- Motor UPDRS **********\n")
for model in models:
    
    model.fit(X_train_fs_total, total_UPDRS)
    y_pred = model.predict(X_test)
    
    print(model)
    print("\tExplained variance:", explained_variance_score(test_total_UPDRS, y_pred))
    print("\tMean absolute error:", mean_absolute_error(test_total_UPDRS, y_pred))
    print("\tRoot Mean Square Error:", np.sqrt(mean_squared_error(test_total_UPDRS, y_pred)))
    print("\tR2 score:", r2_score(test_total_UPDRS, y_pred))
    print()
    

print("\n********* Model Results on Features selected using F-test -- Motor UPDRS **********\n")    
for model in models:
    
    model.fit(X_train_fs_motor, motor_UPDRS)
    y_pred = model.predict(X_test)
    
    print(model)
    print("\tExplained variance:", explained_variance_score(test_motor_UPDRS, y_pred))
    print("\tMean absolute error:", mean_absolute_error(test_motor_UPDRS, y_pred))
    print("\tRoot Mean Square Error:", np.sqrt(mean_squared_error(test_motor_UPDRS, y_pred)))
    print("\tR2 score:", r2_score(test_motor_UPDRS, y_pred))
    print()

In [None]:
print("\n********* Model Results on Feature Selection using Mutual Information -- Total UPDRS **********\n")
for model in models:
    
    model.fit(X_train_mir_total, total_UPDRS)
    y_pred = model.predict(X_test)
    
    print(model)
    print("\tExplained variance:", explained_variance_score(test_total_UPDRS, y_pred))
    print("\tMean absolute error:", mean_absolute_error(test_total_UPDRS, y_pred))
    print("\tRoot Mean Square Error:", np.sqrt(mean_squared_error(test_total_UPDRS, y_pred)))
    print("\tR2 score:", r2_score(test_total_UPDRS, y_pred))
    print()
    

print("\n********* Model Results on Features selected using Mutual Information -- Motor UPDRS **********\n")    
for model in models:
    
    model.fit(X_train_mir_motor, motor_UPDRS)
    y_pred = model.predict(X_test)
    
    print(model)
    print("\tExplained variance:", explained_variance_score(test_motor_UPDRS, y_pred))
    print("\tMean absolute error:", mean_absolute_error(test_motor_UPDRS, y_pred))
    print("\tRoot Mean Square Error:", np.sqrt(mean_squared_error(test_motor_UPDRS, y_pred)))
    print("\tR2 score:", r2_score(test_motor_UPDRS, y_pred))
    print()

In [None]:
print("\n********* Model Results on Feature Selection using RFE Extra Trees Regressor-- Total UPDRS **********\n")
for model in models:
    
    model.fit(X_train_rfe_xtree, total_UPDRS)
    y_pred = model.predict(X_test_rfe_xtree)
    
    print(model)
    print("\tExplained variance:", explained_variance_score(test_total_UPDRS, y_pred))
    print("\tMean absolute error:", mean_absolute_error(test_total_UPDRS, y_pred))
    print("\tRoot Mean Square Error:", np.sqrt(mean_squared_error(test_total_UPDRS, y_pred)))
    print("\tR2 score:", r2_score(test_total_UPDRS, y_pred))
    print()
    

print("\n********* Model Results on Features selected using RFE  Extra Trees Regressor -- Motor UPDRS **********\n")    
for model in models:
    
    model.fit(X_train_rfe_xtree, motor_UPDRS)
    y_pred = model.predict(X_test_rfe_xtree)
    
    print(model)
    print("\tExplained variance:", explained_variance_score(test_motor_UPDRS, y_pred))
    print("\tMean absolute error:", mean_absolute_error(test_motor_UPDRS, y_pred))
    print("\tRoot Mean Square Error:", np.sqrt(mean_squared_error(test_motor_UPDRS, y_pred)))
    print("\tR2 score:", r2_score(test_motor_UPDRS, y_pred))
    print()

In [None]:
print("Selected Features using Recursive Feature Elimination:\n")
for i in rfe_columns:
    print(cols[i])

### The below cell is used for plotting metrics for different models -- Replace X_train, X_test with appropriate values for the model

In [None]:
# Create a table of the results
results = pd.DataFrame(columns=['Model', 'Explained Variance', 'Mean Absolute Error', 'Root Mean Square Error', 'R2 Score'])

for model in models:
    
    model.fit(X_train, total_UPDRS)
    y_pred = model.predict(X_test)
    
    results = results.append({'Model': model.__class__.__name__,
                              'Explained Variance': explained_variance_score(test_total_UPDRS, y_pred),
                              'Mean Absolute Error': mean_absolute_error(test_total_UPDRS, y_pred),
                              'Root Mean Square Error': np.sqrt(mean_squared_error(test_total_UPDRS, y_pred)),
                              'R2 Score': r2_score(test_total_UPDRS, y_pred)}, ignore_index=True)
    
def highlight_min(s):
    is_min = s == s.min()
    return ['background-color: yellow' if v else '' for v in is_min]


# Plot the results
fig, ax = plt.subplots(figsize=(10,6))

# Plot the bars
results.plot(x='Model', y=['Explained Variance', 'Mean Absolute Error', 'Root Mean Square Error', 'R2 Score'], kind='bar', ax=ax)

# Set the title
ax.set_title('Model Results on Original Features -- Total UPDRS')

# Set the x-axis label
ax.set_xlabel('Model')

# Set the y-axis label
ax.set_ylabel('Score')

# Show the plot
plt.show()

# Print the results
print(results)

**From the above, RandomForestRegressor and ExtraTreesRegressor are performing better and it is feasible to perform hyperparameter tuning on these models**

#### Run below cells only if required - Hyperparameter tuning takes lot of time to tune

### ExtraTreesRegressor -- Total UPDRS

In [None]:
param_grid = {
    'n_estimators': [10,50,100],
    'criterion': ['mse', 'mae'],
    'max_depth': [2,4,6,None],
    'min_samples_split': [2,4,6],
    'min_samples_leaf': [1,2],
    'max_features': ['auto','sqrt','log2'],
#     'warm_start':[True, False],
    'bootstrap': [True, False]
}

grid = GridSearchCV(ExtraTreesRegressor(),param_grid)
model = grid.fit(X_train,total_UPDRS)
print(model.best_params_,'\n')
print(model.best_estimator_,'\n')

In [None]:
y_pred = model.best_estimator_.predict(X_test)

print(model)
print("\tExplained variance:", explained_variance_score(test_total_UPDRS, y_pred))
print("\tMean absolute error:", mean_absolute_error(test_total_UPDRS, y_pred))
print("\tRoot Mean Square Error:", np.sqrt(mean_squared_error(test_total_UPDRS, y_pred)))
print("\tR2 score:", r2_score(test_total_UPDRS, y_pred))
print()

### Random Forest  Regressor -- Total UPDRS

In [None]:
param_grid = {
        'bootstrap': [True, False],
        'max_depth': [2,4,6,None],
        'max_features': ['auto', 'sqrt'],
        'min_samples_leaf': [1, 2],
        'min_samples_split': [2,4,6],
        'n_estimators': [10,50,100],
        'criterion':['mse','mae']
}
grid = GridSearchCV(RandomForestRegressor(),param_grid)
model = grid.fit(X_train,total_UPDRS)
print(model.best_params_,'\n')
print(model.best_estimator_,'\n')

## Extra Trees Regressor - Motor UPDRS

In [None]:
param_grid = {
    'n_estimators': [10,50,100],
    'criterion': ['mse', 'mae'],
    'max_depth': [2,8,16,32,None],
    'min_samples_split': [2,4,6],
    'min_samples_leaf': [1,2],
    'max_features': ['auto','sqrt','log2'],
    'warm_start':[True, False],
}

grid = GridSearchCV(ExtraTreesRegressor(),param_grid)
model = grid.fit(X_train,motor_UPDRS)
print(model.best_params_,'\n')
print(model.best_estimator_,'\n')

## Random Forest Regressor - Motor UPDRS

In [None]:
param_grid = {
        'bootstrap': [True, False],
        'max_depth': [2,8,16,32,None],
        'max_features': ['auto', 'sqrt'],
        'min_samples_leaf': [1, 2],
        'min_samples_split': [2,4,6],
        'n_estimators': [10,50,100],
        'criterion':['mse','mae']
}
grid = GridSearchCV(RandomForestRegressor(),param_grid)
model = grid.fit(X_train,motor_UPDRS)
print(model.best_params_,'\n')
print(model.best_estimator_,'\n')