# Parkinson's Disease Detector with Apache Cassandra and PySpark Machine Learning

#### Jupyter notebook inspired by the template at https://github.com/datastaxdevs/workshop-machine-learning/blob/master/jupyter/Random%20Forest.ipynb

In [40]:
!pip3 install matplotlib --quiet
!pip3 install ipykernel --quiet

In [41]:
!pip install cassandra-driver --quiet
!pip install pyspark==3.4.1 --quiet

In [42]:
!python3 -m ipykernel install --user --name=vs-l-pd-detector

Installed kernelspec vs-l-pd-detector in /Users/mariannelynemanaog/Library/Jupyter/kernels/vs-l-pd-detector


In [43]:
!PYDEVD_DISABLE_FILE_VALIDATION=1

In [44]:
import os
import random
import re
import warnings

import cassandra
import matplotlib.pyplot as plt
import pandas as pd
import pyspark

from IPython.display import display, Markdown
from random import randint, randrange

from cassandra.cluster import Cluster
from cassandra.policies import DCAwareRoundRobinPolicy

from pyspark.sql import SparkSession

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier, GBTClassifier, RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import IndexToString, StringIndexer, VectorAssembler
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

In [45]:
warnings.filterwarnings('ignore')
%matplotlib inline

## Set up Apache Cassandra

In [46]:
# Install the latest version of Cassandra (4.1.3) from https://www.apache.org/dyn/closer.lua/cassandra/4.1.3/apache-cassandra-4.1.3-bin.tar.gz

In [47]:
# # Install GPG to verify the hash of the downloaded tarball
# !arch -arm64 brew install gnupg gnupg2

# # Link GPG
# !brew link gnupg

# !gpg --print-md SHA256 apache-cassandra-4.1.3-bin.tar.gz

In [48]:
# Compare the signature with the SHA256 file from the Downloads site
!curl -L https://downloads.apache.org/cassandra/4.1.3/apache-cassandra-4.1.3-bin.tar.gz.sha256

da014999723f4e1e2c15775dac6aaa9ff69a48f6df6465740fcd52ca9d19ea88


In [49]:
# # Unpack the tarball
# !tar xzvf apache-cassandra-4.1.3-bin.tar.gz

In [50]:
# os.chdir('apache-cassandra-4.1.3')

In [51]:
!pwd

/Users/mariannelynemanaog/PycharmProjects/vs-ml-pd-detector/notebooks


In [52]:
!bin/cassandra

zsh:1: no such file or directory: bin/cassandra


In [53]:
# Verify cassandra installation by checking its version number
!cassandra -v

4.1.3


In [54]:
# Start the cassandra server on the terminal
# !cassandra -f

## Creating and loading tables

### Connect to Cassandra

In [55]:
# Get the IP address by running 'cqlsh' on the terminal
cluster = Cluster(['127.0.0.1'], protocol_version=5, load_balancing_policy=DCAwareRoundRobinPolicy())
session = cluster.connect()

### Create keyspace 

In [56]:
session.execute("""
    CREATE KEYSPACE IF NOT EXISTS parkinson 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

<cassandra.cluster.ResultSet at 0x7fc000c5db50>

### Set keyspace 

In [57]:
session.set_keyspace('parkinson')

### Create two tables called `speech_data_train` and `speech_data_test` containing the train and test sets respectively. The PRIMARY will be a unique key (subject_id) for each row.

In [58]:
query = "CREATE TABLE IF NOT EXISTS speech_data_train \
                                   (subject_id text, jitter_percent float, jitter_abs float, rap float, ppq float, \
                                   apq_3 float, apq_5 float, apq_11 float, status int, \
                                   PRIMARY KEY (subject_id))"
session.execute(query)

<cassandra.cluster.ResultSet at 0x7fc000caa550>

In [59]:
query = "CREATE TABLE IF NOT EXISTS speech_data_test \
                                   (subject_id text, jitter_percent float, jitter_abs float, rap float, ppq float, \
                                   apq_3 float, apq_5 float, apq_11 float, status int, \
                                   PRIMARY KEY (subject_id))"
session.execute(query)

<cassandra.cluster.ResultSet at 0x7fbff34118e0>

### Load the train and test datasets from csv files

#### Insert train and test speech data into the tables `speech_data_train` and `speech_data_test` respectively

In [60]:
fileName = '/Users/mariannelynemanaog/PycharmProjects/vs-ml-pd-detector/src/data/train_and_test_sets/train_data.csv'
input_file = open(fileName, 'r')
i = 1
for line_number, line in enumerate(input_file):
    if line_number == 0:
        continue  # Skip the first line, as it has the header with the column names
    subject_id = i
    row = line.replace('\n', "").split(',')
    
    query = "INSERT INTO speech_data_train (subject_id, jitter_percent, jitter_abs, rap, ppq, \
                               apq_3, apq_5, apq_11, status)"
    query = query + " VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)"
    session.execute(query, (str(row[0]), float(row[1]), float(row[2]), float(row[3]), float(row[4]), float(row[5]), float(row[6]), float(row[7]), int(row[8])))
    i = i + 1

fileName = '/Users/mariannelynemanaog/PycharmProjects/vs-ml-pd-detector/src/data/train_and_test_sets/test_data.csv'
input_file = open(fileName, 'r')

for line_number, line in enumerate(input_file):
    if line_number == 0:
        continue  # Skip the first line, as it has the header with the column names
    subject_id = i
    row = line.replace('\n', "").split(',')
        
    query = "INSERT INTO speech_data_test (subject_id, jitter_percent, jitter_abs, rap, ppq, \
                               apq_3, apq_5, apq_11, status)"
    query = query + " VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)"
    session.execute(query, (str(row[0]), float(row[1]), float(row[2]), float(row[3]), float(row[4]), float(row[5]), float(row[6]), float(row[7]), int(row[8])))
    i = i + 1
    

## Machine Learning with Apache Cassandra and Apache Spark

#### Create a spark session that is connected to the database. Then, load each table into a Spark Dataframe and take a count of the number of rows in each.

In [61]:
spark = SparkSession.builder.appName('demo').master("local").getOrCreate()

In [62]:
spark

In [63]:
rows_train = session.execute('select * from speech_data_train;')
df_train = pd.DataFrame(list(rows_train))

rows_test = session.execute('select * from speech_data_test;')
df_test = pd.DataFrame(list(rows_test))

In [64]:
df_train.head()

Unnamed: 0,subject_id,apq_11,apq_3,apq_5,jitter_abs,jitter_percent,ppq,rap,status
0,phon_R01_S10_3,0.01033,0.00777,0.00898,9e-06,0.0021,0.00137,0.00109,0
1,phon_R01_S32_2,0.00903,0.00476,0.00588,2e-05,0.0027,0.00135,0.00116,1
2,CONT-11,0.039913,0.030384,0.035978,4.3e-05,0.53133,0.00332,0.002693,0
3,0.000157842,0.819181,18.808001,19.973,0.779,0.583,13.002,1.75,1
4,9.8239e-05,0.887069,11.811,12.712,0.768,0.742,11.455,2.226,1


In [65]:
df_test.head()

Unnamed: 0,subject_id,apq_11,apq_3,apq_5,jitter_abs,jitter_percent,ppq,rap,status
0,151,0.08309,0.04866,0.05779,9.5e-05,0.0087,0.00533,0.00329,1
1,6,0.02123,0.01087,0.0125,6e-06,0.00084,0.00041,0.00018,1
2,191,0.03954,0.0214,0.02522,9e-06,0.00121,0.00056,0.00023,1
3,210,0.06352,0.02399,0.03212,9e-06,0.00097,0.00059,0.00029,0
4,90,0.10453,0.07051,0.08295,6.2e-05,0.00576,0.00362,0.00189,1


In [66]:
print("Train Table Speech Data Row Count: ")
print(len(df_train))

Train Table Speech Data Row Count: 
1503


In [67]:
print("Test Table Speech Data Row Count: ")
print(len(df_test))

Test Table Speech Data Row Count: 
252


In [68]:
# Create PySpark DataFrames from Pandas

print('The PySpark train df is: ')
sparkDF_train=spark.createDataFrame(df_train) 
sparkDF_train.printSchema()
sparkDF_train.show()

print('The PySpark test df is: ')
sparkDF_test=spark.createDataFrame(df_test) 
sparkDF_test.printSchema()
sparkDF_test.show()

The PySpark train df is: 
root
 |-- subject_id: string (nullable = true)
 |-- apq_11: double (nullable = true)
 |-- apq_3: double (nullable = true)
 |-- apq_5: double (nullable = true)
 |-- jitter_abs: double (nullable = true)
 |-- jitter_percent: double (nullable = true)
 |-- ppq: double (nullable = true)
 |-- rap: double (nullable = true)
 |-- status: long (nullable = true)

+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------+
|    subject_id|              apq_11|               apq_3|               apq_5|          jitter_abs|      jitter_percent|                 ppq|                 rap|status|
+--------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------+
|phon_R01_S10_3|0.010329999960958958|0.007770000025629997|0.008980000391602516|9.000000318337698E-6|0.00209

#### Helper function to have nicer formatting of Spark DataFrames

In [69]:
# Helper for pretty formatting for Spark DataFrames
def showDF(df, limitRows =  5, truncate = True):
    if(truncate):
        pd.set_option('display.max_colwidth', 50)
    else:
        pd.set_option('display.max_colwidth', -1)
    pd.set_option('display.max_rows', limitRows)
    display(df.limit(limitRows).toPandas())
    pd.reset_option('display.max_rows')

#### Create Vectors with all elements of the speech datasets

In [70]:
assembler = VectorAssembler(
    inputCols=['jitter_percent', 'jitter_abs', 'rap', 'ppq', 'apq_3', 'apq_5', 'apq_11'],
    outputCol='features')

trainingData = assembler.transform(sparkDF_train)

labelIndexer_train = StringIndexer(inputCol="status", outputCol="label", handleInvalid='keep')
train = labelIndexer_train.fit(trainingData).transform(trainingData)

showDF(train)
print(train.count())

testingData = assembler.transform(sparkDF_test)

labelIndexer_test = StringIndexer(inputCol="status", outputCol="label", handleInvalid='keep')
test = labelIndexer_test.fit(testingData).transform(testingData)

showDF(test)
print(test.count())

Unnamed: 0,subject_id,apq_11,apq_3,apq_5,jitter_abs,jitter_percent,ppq,rap,status,features,label
0,phon_R01_S10_3,0.01033,0.00777,0.00898,9e-06,0.0021,0.00137,0.00109,0,"[0.002099999925121665, 9.000000318337698e-06, ...",1.0
1,phon_R01_S32_2,0.00903,0.00476,0.00588,2e-05,0.0027,0.00135,0.00116,1,"[0.0027000000700354576, 1.9999999494757503e-05...",0.0
2,CONT-11,0.039913,0.030384,0.035978,4.3e-05,0.53133,0.00332,0.002693,0,"[0.5313299894332886, 4.3263000407023355e-05, 0...",1.0
3,0.000157842,0.819181,18.808001,19.973,0.779,0.583,13.002,1.75,1,"[0.5830000042915344, 0.7789999842643738, 1.75,...",0.0
4,9.8239e-05,0.887069,11.811,12.712,0.768,0.742,11.455,2.226,1,"[0.7419999837875366, 0.7680000066757202, 2.226...",0.0


1503


Unnamed: 0,subject_id,apq_11,apq_3,apq_5,jitter_abs,jitter_percent,ppq,rap,status,features,label
0,151,0.08309,0.04866,0.05779,9.5e-05,0.0087,0.00533,0.00329,1,"[0.008700000122189522, 9.549999958835542e-05, ...",0.0
1,6,0.02123,0.01087,0.0125,6e-06,0.00084,0.00041,0.00018,1,"[0.0008399999933317304, 5.580000106419902e-06,...",0.0
2,191,0.03954,0.0214,0.02522,9e-06,0.00121,0.00056,0.00023,1,"[0.0012100000167265534, 9.420000424142927e-06,...",0.0
3,210,0.06352,0.02399,0.03212,9e-06,0.00097,0.00059,0.00029,0,"[0.0009699999936856329, 8.809999599179719e-06,...",1.0
4,90,0.10453,0.07051,0.08295,6.2e-05,0.00576,0.00362,0.00189,1,"[0.005760000087320805, 6.22000006842427e-05, 0...",0.0


252


### Train a Random Forest model whilst preserving the train and test sets generated above, which prevent both time-related data leakage and do not have any overlapping subjects between train and test sets. 

### The optimal hyperparameters are obtained via grid search optimisation when cross-validating the model during training.

In [71]:
rf = RandomForestClassifier(labelCol="label", featuresCol="features")

numFolds = 5

weighted_recall_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="weightedRecall")

pipeline = Pipeline(stages=[rf])
paramGrid = (ParamGridBuilder().addGrid(param=rf.numTrees, values=[8, 10, 12]).addGrid(param=rf.seed, values=[13, 17, 42]).addGrid(param=rf.maxDepth, values=[5, 8, 10]).build())

crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=weighted_recall_evaluator,
    numFolds=numFolds)

model = crossval.fit(train)

predictions_from_Random_Forest = model.transform(test)

print(predictions_from_Random_Forest.count())
showDF(predictions_from_Random_Forest)

252


Unnamed: 0,subject_id,apq_11,apq_3,apq_5,jitter_abs,jitter_percent,ppq,rap,status,features,label,rawPrediction,probability,prediction
0,151,0.08309,0.04866,0.05779,9.5e-05,0.0087,0.00533,0.00329,1,"[0.008700000122189522, 9.549999958835542e-05, ...",0.0,"[9.985806957203188, 0.014193042796813094, 0.0]","[0.9985806957203186, 0.001419304279681309, 0.0]",0.0
1,6,0.02123,0.01087,0.0125,6e-06,0.00084,0.00041,0.00018,1,"[0.0008399999933317304, 5.580000106419902e-06,...",0.0,"[5.819444444444445, 4.180555555555555, 0.0]","[0.5819444444444445, 0.4180555555555555, 0.0]",0.0
2,191,0.03954,0.0214,0.02522,9e-06,0.00121,0.00056,0.00023,1,"[0.0012100000167265534, 9.420000424142927e-06,...",0.0,"[9.708333333333332, 0.2916666666666667, 0.0]","[0.9708333333333334, 0.029166666666666674, 0.0]",0.0
3,210,0.06352,0.02399,0.03212,9e-06,0.00097,0.00059,0.00029,0,"[0.0009699999936856329, 8.809999599179719e-06,...",1.0,"[8.708333333333332, 1.2916666666666667, 0.0]","[0.8708333333333333, 0.1291666666666667, 0.0]",0.0
4,90,0.10453,0.07051,0.08295,6.2e-05,0.00576,0.00362,0.00189,1,"[0.005760000087320805, 6.22000006842427e-05, 0...",0.0,"[9.985806957203188, 0.014193042796813094, 0.0]","[0.9985806957203186, 0.001419304279681309, 0.0]",0.0


In [72]:
# Visualising the optimal hyperparameters obtained via grid search optimisation
# when cross-validating the Random Forest model during training
bestRandomForestModel = model.bestModel
print('Best param for numTrees is: ', bestRandomForestModel.stages[-1]._java_obj.parent().getNumTrees())
print('Best param for seed is: ', bestRandomForestModel.stages[-1]._java_obj.parent().getSeed())
print('Best param for maxDepth is: ', bestRandomForestModel.stages[-1]._java_obj.parent().getMaxDepth())

Best param for numTrees is:  10
Best param for seed is:  42
Best param for maxDepth is:  10


In [73]:
showDF(predictions_from_Random_Forest.select("status", "label", "prediction", "probability"))

Unnamed: 0,status,label,prediction,probability
0,1,0.0,0.0,"[0.9985806957203186, 0.001419304279681309, 0.0]"
1,1,0.0,0.0,"[0.5819444444444445, 0.4180555555555555, 0.0]"
2,1,0.0,0.0,"[0.9708333333333334, 0.029166666666666674, 0.0]"
3,0,1.0,0.0,"[0.8708333333333333, 0.1291666666666667, 0.0]"
4,1,0.0,0.0,"[0.9985806957203186, 0.001419304279681309, 0.0]"


### Train, Cross-Validate, and Optimise Decision Tree and Gradient-Boosted Decision Tree models to compare their predictive performance against the optimal Random Forest model trained above, and select the best-performing model among these three decision tree-based approaches.

In [83]:
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")

pipeline = Pipeline(stages=[dt])
paramGrid = (ParamGridBuilder().addGrid(param=dt.seed, values=[13, 17, 42]).addGrid(param=dt.maxDepth, values=[5, 8, 10]).build())

dt_crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=weighted_recall_evaluator,
    numFolds=numFolds)

dt_model = dt_crossval.fit(train)

predictions_from_DT = dt_model.transform(test)

print(predictions_from_DT.count())
showDF(predictions_from_DT)

252


Unnamed: 0,subject_id,apq_11,apq_3,apq_5,jitter_abs,jitter_percent,ppq,rap,status,features,label,rawPrediction,probability,prediction
0,151,0.08309,0.04866,0.05779,9.5e-05,0.0087,0.00533,0.00329,1,"[0.008700000122189522, 9.549999958835542e-05, ...",0.0,"[245.0, 1.0, 0.0]","[0.9959349593495935, 0.0040650406504065045, 0.0]",0.0
1,6,0.02123,0.01087,0.0125,6e-06,0.00084,0.00041,0.00018,1,"[0.0008399999933317304, 5.580000106419902e-06,...",0.0,"[1.0, 0.0, 0.0]","[1.0, 0.0, 0.0]",0.0
2,191,0.03954,0.0214,0.02522,9e-06,0.00121,0.00056,0.00023,1,"[0.0012100000167265534, 9.420000424142927e-06,...",0.0,"[245.0, 1.0, 0.0]","[0.9959349593495935, 0.0040650406504065045, 0.0]",0.0
3,210,0.06352,0.02399,0.03212,9e-06,0.00097,0.00059,0.00029,0,"[0.0009699999936856329, 8.809999599179719e-06,...",1.0,"[245.0, 1.0, 0.0]","[0.9959349593495935, 0.0040650406504065045, 0.0]",0.0
4,90,0.10453,0.07051,0.08295,6.2e-05,0.00576,0.00362,0.00189,1,"[0.005760000087320805, 6.22000006842427e-05, 0...",0.0,"[245.0, 1.0, 0.0]","[0.9959349593495935, 0.0040650406504065045, 0.0]",0.0


In [84]:
# Visualising the optimal hyperparameters obtained via grid search optimisation
# when cross-validating the Decision Tree model during training
bestDTModel = dt_model.bestModel
print('Best param for seed is: ', bestDTModel.stages[-1]._java_obj.parent().getSeed())
print('Best param for maxDepth is: ', bestDTModel.stages[-1]._java_obj.parent().getMaxDepth())

Best param for seed is:  13
Best param for maxDepth is:  8


In [85]:
gbt = GBTClassifier(labelCol="label", featuresCol="features")

pipeline = Pipeline(stages=[gbt])
paramGrid = (ParamGridBuilder().addGrid(param=gbt.seed, values=[13, 17, 42]).addGrid(param=gbt.maxDepth, values=[5, 8, 10]).build())

gbt_crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=weighted_recall_evaluator,
    numFolds=numFolds)

gbt_model = gbt_crossval.fit(train)

predictions_from_GBT = gbt_model.transform(test)

print(predictions_from_GBT.count())
showDF(predictions_from_GBT)

252


Unnamed: 0,subject_id,apq_11,apq_3,apq_5,jitter_abs,jitter_percent,ppq,rap,status,features,label,rawPrediction,probability,prediction
0,151,0.08309,0.04866,0.05779,9.5e-05,0.0087,0.00533,0.00329,1,"[0.008700000122189522, 9.549999958835542e-05, ...",0.0,"[1.5231557166980803, -1.5231557166980803]","[0.9546230131687325, 0.04537698683126745]",0.0
1,6,0.02123,0.01087,0.0125,6e-06,0.00084,0.00041,0.00018,1,"[0.0008399999933317304, 5.580000106419902e-06,...",0.0,"[1.1798928565944504, -1.1798928565944504]","[0.913708911571824, 0.08629108842817601]",0.0
2,191,0.03954,0.0214,0.02522,9e-06,0.00121,0.00056,0.00023,1,"[0.0012100000167265534, 9.420000424142927e-06,...",0.0,"[1.5686453937407265, -1.5686453937407265]","[0.9584050123916438, 0.041594987608356226]",0.0
3,210,0.06352,0.02399,0.03212,9e-06,0.00097,0.00059,0.00029,0,"[0.0009699999936856329, 8.809999599179719e-06,...",1.0,"[1.3846903786537041, -1.3846903786537041]","[0.9409986154312837, 0.059001384568716286]",0.0
4,90,0.10453,0.07051,0.08295,6.2e-05,0.00576,0.00362,0.00189,1,"[0.005760000087320805, 6.22000006842427e-05, 0...",0.0,"[1.5307898458492857, -1.5307898458492857]","[0.9552798303717731, 0.04472016962822689]",0.0


In [86]:
# Visualising the optimal hyperparameters obtained via grid search optimisation
# when cross-validating the Gradient-Boosted Tree model during training
bestGBTModel = gbt_model.bestModel
print('Best param for seed is: ', bestGBTModel.stages[-1]._java_obj.parent().getSeed())
print('Best param for maxDepth is: ', bestGBTModel.stages[-1]._java_obj.parent().getMaxDepth())

Best param for seed is:  13
Best param for maxDepth is:  5


### Leverage the MulticlassClassificationEvaluator to evaluate the accuracy and reliability of the predictions among the three models evaluated to inform the selection of the optimal decision tree-based approach. 

In [87]:
# compute key evaluation metrics on the test set, i.e., accuracy, 
# weightedPrecision, weightedRecall, weightedFMeasure

precision_vals = 3

accuracy_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy_Random_Forest = accuracy_evaluator.evaluate(predictions_from_Random_Forest)
print("Test set accuracy for the Random Forest = " + str(round(accuracy_Random_Forest, precision_vals)))
accuracy_DT = accuracy_evaluator.evaluate(predictions_from_DT)
print("Test set accuracy for the Decision Tree = " + str(round(accuracy_DT, precision_vals)))
accuracy_GBT = accuracy_evaluator.evaluate(predictions_from_GBT)
print("Test set accuracy for the Gradient-Boosted Tree = " + str(round(accuracy_GBT, precision_vals)))

weighted_precision_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="weightedPrecision")
weighted_precision_Random_Forest = weighted_precision_evaluator.evaluate(predictions_from_Random_Forest)
print("Test set weighted precision for the Random Forest = " + str(round(weighted_precision_Random_Forest, precision_vals)))
weighted_precision_DT = weighted_precision_evaluator.evaluate(predictions_from_DT)
print("Test set weighted precision for the Decision Tree = " + str(round(weighted_precision_DT, precision_vals)))
weighted_precision_GBT = weighted_precision_evaluator.evaluate(predictions_from_GBT)
print("Test set weighted precision for the Gradient-Boosted Tree = " + str(round(weighted_precision_GBT, precision_vals)))


# Recall is also named as 'sensitivity' or 'true positive rate', and is the key metric to increase in this project.
weighted_recall_Random_Forest = weighted_recall_evaluator.evaluate(predictions_from_Random_Forest)
print("Test set weighted recall for the Random Forest = " + str(round(weighted_recall_Random_Forest, precision_vals)))
weighted_recall_DT = weighted_recall_evaluator.evaluate(predictions_from_DT)
print("Test set weighted recall for the Decision Tree = " + str(round(weighted_recall_DT, precision_vals)))
weighted_recall_GBT = weighted_recall_evaluator.evaluate(predictions_from_GBT)
print("Test set weighted recall for the Gradient-Boosted Tree = " + str(round(weighted_recall_GBT, precision_vals)))

# F-measure is also named as 'F-score' or 'F1-score', and is the harmonic mean between precision and recall.
weighted_f_measure_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="weightedFMeasure")
weighted_f_measure_Random_Forest = weighted_f_measure_evaluator.evaluate(predictions_from_Random_Forest)
print("Test set weighted F-measure for the Random Forest = " + str(round(weighted_f_measure_Random_Forest, precision_vals)))
weighted_f_measure_DT = weighted_f_measure_evaluator.evaluate(predictions_from_DT)
print("Test set weighted F-measure for the Decision Tree = " + str(round(weighted_f_measure_DT, precision_vals)))
weighted_f_measure_GBT = weighted_f_measure_evaluator.evaluate(predictions_from_GBT)
print("Test set weighted F-measure for the Gradient-Boosted Tree = " + str(round(weighted_f_measure_GBT, precision_vals)))

Test set accuracy for the Random Forest = 0.738
Test set accuracy for the Decision Tree = 0.75
Test set accuracy for the Gradient-Boosted Tree = 0.762
Test set weighted precision for the Random Forest = 0.668
Test set weighted precision for the Decision Tree = 0.813
Test set weighted precision for the Gradient-Boosted Tree = 0.737
Test set weighted recall for the Random Forest = 0.738
Test set weighted recall for the Decision Tree = 0.75
Test set weighted recall for the Gradient-Boosted Tree = 0.762
Test set weighted F-measure for the Random Forest = 0.664
Test set weighted F-measure for the Decision Tree = 0.647
Test set weighted F-measure for the Gradient-Boosted Tree = 0.7


### Insights on predictive performance

The initial set of metrics (from the [PR no. 2](https://github.com/marianne-manaog/vs-ml-pd-detector/pull/2)) of a Random Forest model, using 
80% and 20% of the data split randomly between the train and test sets but without training the model via cross-validation, was as follows:
- Test set accuracy = 0.709
- Test set weighted precision = 0.706
- Test set weighted recall = 0.709
- Test set weighted F-measure = 0.707

The second set of metrics (from the [PR no. 3](https://github.com/marianne-manaog/vs-ml-pd-detector/pull/3)) of a Random Forest model, using the previously generated data splits to avoid time- and subject-related data leakage (2008-2016 data for training, 2018 data on different subjects for testing) but without training the model via cross-validation, was as follows:
- Test set accuracy = 0.75
- Test set weighted precision = 0.706
- Test set weighted recall = 0.75
- Test set weighted F-measure = 0.691

The third set of metrics (from the [PR no. 4](https://github.com/marianne-manaog/vs-ml-pd-detector/pull/4)) of a Random Forest model, using the previously generated data splits to avoid time- and subject-related data leakage (2008-2016 data for training, 2018 data on different subjects for testing) but training the model with cross-validation, was as follows:
- Test set accuracy = 0.75
- Test set weighted precision = 0.813
- Test set weighted recall = 0.75
- Test set weighted F-measure = 0.647

The fourth set of metrics (from the [PR no. 5](https://github.com/marianne-manaog/vs-ml-pd-detector/pull/5)) of a Gradient-Boosted Tree model, using the previously generated data splits to avoid time- and subject-related data leakage (2008-2016 data for training, 2018 data on different subjects for testing) but training the model with cross-validation, was as follows:
- Test set accuracy = 0.762
- Test set weighted precision = 0.737
- Test set weighted recall = 0.762
- Test set weighted F-measure = 0.7


I.e., the following predictive performance changes have been achieved between the third (Random Forest) and fourth (Gradient-Boosted Tree) set of metrics:
- Increased accuracy by 1.2%
- Decreased weighted precision by 7.6%
- Increased weighted recall by 1.2%
- Increased weighted F-measure by 5.3%

Thus, overall and considering that the weighted recall is the key metric to increase in this project, the predictive performance of the Gradient-Boosted model was found slightly higher than the Random Forest and the Decision Tree models. Thus, the Gradient-Boosted model was selected as the optimal one for this project.

In [None]:
# session.execute("""drop table speech_data""")