# Modeling 

In this notebook, we will be predicting the different UPDRS scores for the patients at each time point. We will be using the selected features from Boruta for the predictions. We will be using three different models light gbm, SVM and logistic regression and testing which one works best and gives the most optimal results. The reason that we chose light gbm rather than traditional gradient boosting models or random forest is its faster training time and higher accuracy. 

Load the libraries 

In [46]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
from numpy import arange
import pandas as pd
import lightgbm
from lightgbm import LGBMClassifier
from bayes_opt import BayesianOptimization
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV, RepeatedKFold
from sklearn.metrics import accuracy_score, f1_score
from sklearn.svm import SVC
from tabulate import tabulate

Load the training and test datasets 

In [3]:
X_train=pd.read_csv("X_train.csv",index_col=0)
y_train=pd.read_csv('y_train.csv',index_col=0)
X_test=pd.read_csv("X_test.csv")
y_test=pd.read_csv('y_test.csv')


Load the selected protein and peptide abundances which are important for each of the UPDRS scores based on the boruta algorithm.

In [4]:
features_UPDRS1=pd.read_csv("features_UPDRS1",header=None)
features_UPDRS2=pd.read_csv("features_UPDRS2",header=None)
features_UPDRS3=pd.read_csv("features_UPDRS3",header=None)
features_UPDRS4=pd.read_csv("features_UPDRS4",header=None)

We will be modelling each of the developed UPDRS scores seperately with a light gbm model and doing hyper parameter tuning to get the parameters with the best cross validation. We will be using bayesian optimization for hyperparameter tuning when using lightgbm with five fold cross validation. Bayesian Optimization is performed over the specified search space (params_bounds) for a number of initial points (init_points) and iterations (n_iter).Apply the Bayesian optimizer to the function we created in the previous step to identify the best hyperparameters. We will run 10 iterations and set init_points = 2.

We will first be selecting the UPDRS 1 features only for X_train and then predicting the UPDRS 1 scores

In [5]:
features_UPDRS1.loc[:,0]

0                                              2
1                                             25
2                                             89
3                                            229
4                                            839
5    upd23b_clinical_state_on_medication_Unknown
Name: 0, dtype: object

In [6]:
X_train_UPDRS1=X_train[features_UPDRS1.loc[:,0].tolist()]

**We will now be defining a function for five fold cross-validation with light gbm and apply it to each of the scores. The accuracy metric that we will be using is  RMSE (root mean square error), which measures the average difference between the values predicted by the model compared to the actual values. The RMSE score reported by scikit-learn's scoring mechanism is negative to ensure higher values still indicate better models. R2 values are another metric that can be used which is a scaled version of RMSE from 0 to 1. However, lightgbm regression did not have R2 as a metric so we used RMSE 

In [7]:
def lgb_bayes_optimize(X_train, y_train):
    # Define the evaluation function for Bayesian Optimization
    def lgb_eval(num_leaves, max_depth, lambda_l2, lambda_l1, min_child_samples, min_data_in_leaf):
        params = {"objective" : "regression","metric" : "RMSE",'is_unbalance': True,"num_leaves" : int(num_leaves), "max_depth" : int(max_depth),
 "lambda_l2" : lambda_l2,"lambda_l1" : lambda_l1,"num_threads" : 20, "min_child_samples" : int(min_child_samples), 'min_data_in_leaf': int(min_data_in_leaf),
"learning_rate" : 0.03, "subsample_freq" : 5,"verbosity" : -1}
 # Create LightGBM datasets
        lgtrain = lightgbm.Dataset(X_train, y_train)
# Perform cross-validation with early stopping
        cv_result = lightgbm.cv(params,
                       lgtrain,
                        num_boost_round=100,
                       stratified=False, callbacks=[ lightgbm.early_stopping(stopping_rounds=1000),], nfold=3)
        
        # Return the negative RMSE to be maximized by Bayesian Optimization
        return -1.0 * cv_result['valid rmse-mean'][-1]

    # Define the search space for Bayesian Optimization
    params_bounds = {
        'num_leaves': (25, 4000),
        'max_depth': (5, 63),
        'lambda_l2': (0.0, 0.05),
        'lambda_l1': (0.0, 0.05),
        'min_child_samples': (50, 10000),
        'min_data_in_leaf': (100, 2000)
    }
    
    # Initialize Bayesian Optimization
    lgbBO = BayesianOptimization(lgb_eval, params_bounds, random_state=42)

    # Perform Bayesian Optimization
    lgbBO.maximize(init_points=2, n_iter=10)

    # Get the best parameters
    best_params = lgbBO.max['params']
    #Get rmse
    rmse= lgbBO.max['target']
       
    return  best_params, rmse

Train light gbm wth cross validation and hyperparameter tuning for UPDRS_1

In [8]:
best_params, rmse = lgb_bayes_optimize(X_train_UPDRS1,y_train.updrs_1)

print("Best parameters found for UPDRS1:", best_params)
print("RMSE for UPDRS1", rmse)


|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 5.27547 + 0.114277
| [0m1        [0m | [0m-5.275   [0m | [0m0.01873  [0m | [0m0.04754  [0m | [0m47.46    [0m | [0m6.007e+03[0m | [0m396.4    [0m | [0m645.1    [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[98]	cv_agg's valid rmse: 4.75044 + 0.16736
| [95m2        [0m | [95m-4.75    [0m | [95m0.002904 [0m | [95m0.04331  [0m | [95m39.86    [0m | [95m7.095e+03[0m | [95m139.1    [0m | [95m3.88e+03 [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 5.27547 + 0.114277
| [0m3

We would like to see the best parameters and the mean RMSE for the training dataset prediction of UPDRS_1

After cross validation the RMSE  4.74.

Let me see if the same model with UPDRS associated boruta associated features  works well for cross validation for the other UPDRS scores.

In [9]:
X_train_UPDRS2=X_train[features_UPDRS2.loc[:,0].tolist()]

In [10]:

best_params2, rmse2 = lgb_bayes_optimize(X_train_UPDRS2,y_train.updrs_2)

print("Best parameters found for UPDRS1:", best_params2)
print("RMSE for UPDRS1", rmse2)
RMSE_lightgbm={"UPDRS_2": rmse2}

|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 5.91598 + 0.097993
| [0m1        [0m | [0m-5.916   [0m | [0m0.01873  [0m | [0m0.04754  [0m | [0m47.46    [0m | [0m6.007e+03[0m | [0m396.4    [0m | [0m645.1    [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[100]	cv_agg's valid rmse: 4.64158 + 0.128303
| [95m2        [0m | [95m-4.642   [0m | [95m0.002904 [0m | [95m0.04331  [0m | [95m39.86    [0m | [95m7.095e+03[0m | [95m139.1    [0m | [95m3.88e+03 [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 5.91598 + 0.097993
| [0

The best RMSE for lightgbm is 4.57 for predicting UPDRS2

Let us see the performance of light gbm with UPDRS3. 

In [11]:
X_train_UPDRS3=X_train[features_UPDRS3.loc[:,0].tolist()]

In [12]:

best_params3, rmse3 = lgb_bayes_optimize(X_train_UPDRS3,y_train.updrs_3)

print("Best parameters found for UPDRS1:", best_params2)
print("RMSE for UPDRS3", rmse3)


|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 15.2305 + 0.122351
| [0m1        [0m | [0m-15.23   [0m | [0m0.01873  [0m | [0m0.04754  [0m | [0m47.46    [0m | [0m6.007e+03[0m | [0m396.4    [0m | [0m645.1    [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[100]	cv_agg's valid rmse: 10.5285 + 0.255624
| [95m2        [0m | [95m-10.53   [0m | [95m0.002904 [0m | [95m0.04331  [0m | [95m39.86    [0m | [95m7.095e+03[0m | [95m139.1    [0m | [95m3.88e+03 [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 15.2305 + 0.122351
| [0


 The best RMSE for UPDRS3 is 10.16

In [14]:
X_train_UPDRS4=X_train[features_UPDRS4.loc[:,0].tolist()]


best_params4, rmse4 = lgb_bayes_optimize(X_train_UPDRS4,y_train.updrs_4)

print("Best parameters found for UPDRS4:", best_params4)
print("RMSE for UPDRS4", rmse4)
RMSE_lightgbm={"UPDRS_4": rmse4}

|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 2.31581 + 0.147733
| [0m1        [0m | [0m-2.316   [0m | [0m0.01873  [0m | [0m0.04754  [0m | [0m47.46    [0m | [0m6.007e+03[0m | [0m396.4    [0m | [0m645.1    [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[100]	cv_agg's valid rmse: 2.03066 + 0.12377
| [95m2        [0m | [95m-2.031   [0m | [95m0.002904 [0m | [95m0.04331  [0m | [95m39.86    [0m | [95m7.095e+03[0m | [95m139.1    [0m | [95m3.88e+03 [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 2.31581 + 0.147733
| [0m

 The best RMSE for UPDRS4 is 2.03

The RMSE values are very high and show that the target variables are not being predicted by the selected features. In general the RMSE should be within 10% of the mean. When we look at the RMSE we can see that our RMSE are very high meaning that the features show little trend in predicting the UPDRS scores

In [15]:
Mean_UPDRSscores={"UPDRS1":y_train.updrs_1.mean(),"UPDRS2":y_train.updrs_2.mean(),"UPDRS3":y_train.updrs_3.mean(),"UPDRS4":y_train.updrs_4.mean()}
RMSE_lightgbm={"UPDRS_1": rmse,"UPDRS_2": rmse2,"UPDRS_3": rmse3,"UPDRS_4": rmse4}
print(Mean_UPDRSscores)
print(RMSE_lightgbm)

{'UPDRS1': 6.478922716627634, 'UPDRS2': 5.740046838407494, 'UPDRS3': 17.50936768149883, 'UPDRS4': 0.9637002341920374}
{'UPDRS_1': -4.744981175550429, 'UPDRS_2': -4.573803939357066, 'UPDRS_3': -10.169609133540808, 'UPDRS_4': -2.0306641374085834}


**Let me see how elastic net regression performs in cross validation in RMSE using the same features that we used for prediction for each of the UPDRS scores. We will be doing five fold cross validation using grid search for hyper parameter tuning looking at different kernels and regularization parameter C. We can do an exhaustive search using grid search looking at all possible combinations rather than a smart bayesian based approach as we have fewer parameters to tune.

In [17]:
def pred(x,y):
    param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf', 'poly']}
    grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3,scoring="neg_root_mean_squared_error") 
  # fitting the models for grid search 
    grid.fit(x, y) 
    best_params = grid.best_params_
    best_score = grid.best_score_
    return(best_params,best_score)

In [18]:
SVM_updrs1=pred(X_train_UPDRS1, y_train.updrs_1)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END .............C=0.1, kernel=linear;, score=-6.350 total time=   0.1s
[CV 2/5] END .............C=0.1, kernel=linear;, score=-6.039 total time=   0.0s
[CV 3/5] END .............C=0.1, kernel=linear;, score=-5.566 total time=   0.0s
[CV 4/5] END .............C=0.1, kernel=linear;, score=-6.369 total time=   0.0s
[CV 5/5] END .............C=0.1, kernel=linear;, score=-5.612 total time=   0.0s
[CV 1/5] END ................C=0.1, kernel=rbf;, score=-7.698 total time=   0.0s
[CV 2/5] END ................C=0.1, kernel=rbf;, score=-7.790 total time=   0.0s
[CV 3/5] END ................C=0.1, kernel=rbf;, score=-7.579 total time=   0.0s
[CV 4/5] END ................C=0.1, kernel=rbf;, score=-7.612 total time=   0.0s
[CV 5/5] END ................C=0.1, kernel=rbf;, score=-7.328 total time=   0.0s
[CV 1/5] END ...............C=0.1, kernel=poly;, score=-7.651 total time=   0.0s
[CV 2/5] END ...............C=0.1, kernel=poly;, 

In [19]:
SVM_updrs2=pred(X_train_UPDRS2, y_train.updrs_2)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END .............C=0.1, kernel=linear;, score=-6.903 total time=   0.0s
[CV 2/5] END .............C=0.1, kernel=linear;, score=-6.856 total time=   0.0s
[CV 3/5] END .............C=0.1, kernel=linear;, score=-7.205 total time=   0.0s
[CV 4/5] END .............C=0.1, kernel=linear;, score=-7.294 total time=   0.0s
[CV 5/5] END .............C=0.1, kernel=linear;, score=-6.200 total time=   0.0s
[CV 1/5] END ................C=0.1, kernel=rbf;, score=-8.104 total time=   0.0s
[CV 2/5] END ................C=0.1, kernel=rbf;, score=-8.180 total time=   0.0s
[CV 3/5] END ................C=0.1, kernel=rbf;, score=-8.324 total time=   0.0s
[CV 4/5] END ................C=0.1, kernel=rbf;, score=-8.479 total time=   0.0s
[CV 5/5] END ................C=0.1, kernel=rbf;, score=-8.038 total time=   0.0s
[CV 1/5] END ...............C=0.1, kernel=poly;, score=-7.919 total time=   0.0s
[CV 2/5] END ...............C=0.1, kernel=poly;, 

In [20]:
SVM_updrs3=pred(X_train_UPDRS3, y_train.updrs_3)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END ............C=0.1, kernel=linear;, score=-15.334 total time=   0.0s
[CV 2/5] END ............C=0.1, kernel=linear;, score=-16.758 total time=   0.0s
[CV 3/5] END ............C=0.1, kernel=linear;, score=-17.904 total time=   0.0s
[CV 4/5] END ............C=0.1, kernel=linear;, score=-15.932 total time=   0.0s
[CV 5/5] END ............C=0.1, kernel=linear;, score=-15.285 total time=   0.0s
[CV 1/5] END ...............C=0.1, kernel=rbf;, score=-22.246 total time=   0.1s
[CV 2/5] END ...............C=0.1, kernel=rbf;, score=-22.871 total time=   0.1s
[CV 3/5] END ...............C=0.1, kernel=rbf;, score=-23.719 total time=   0.1s
[CV 4/5] END ...............C=0.1, kernel=rbf;, score=-23.460 total time=   0.1s
[CV 5/5] END ...............C=0.1, kernel=rbf;, score=-23.544 total time=   0.1s
[CV 1/5] END ..............C=0.1, kernel=poly;, score=-22.235 total time=   0.0s
[CV 2/5] END ..............C=0.1, kernel=poly;, s

In [21]:
SVM_updrs4=pred(X_train_UPDRS4, y_train.updrs_4)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END .............C=0.1, kernel=linear;, score=-2.320 total time=   0.0s
[CV 2/5] END .............C=0.1, kernel=linear;, score=-2.253 total time=   0.0s
[CV 3/5] END .............C=0.1, kernel=linear;, score=-2.682 total time=   0.0s
[CV 4/5] END .............C=0.1, kernel=linear;, score=-2.900 total time=   0.0s
[CV 5/5] END .............C=0.1, kernel=linear;, score=-2.405 total time=   0.0s
[CV 1/5] END ................C=0.1, kernel=rbf;, score=-2.320 total time=   0.0s
[CV 2/5] END ................C=0.1, kernel=rbf;, score=-2.253 total time=   0.0s
[CV 3/5] END ................C=0.1, kernel=rbf;, score=-2.682 total time=   0.0s
[CV 4/5] END ................C=0.1, kernel=rbf;, score=-2.900 total time=   0.0s
[CV 5/5] END ................C=0.1, kernel=rbf;, score=-2.405 total time=   0.0s
[CV 1/5] END ...............C=0.1, kernel=poly;, score=-2.320 total time=   0.0s
[CV 2/5] END ...............C=0.1, kernel=poly;, 

In [22]:
RMSE_SVM={"UPDRS_1":SVM_updrs1[1],"UPDRS_2":SVM_updrs2[1],"UPDRS_3":SVM_updrs3[1],"UPDRS_4":SVM_updrs4[1]}
print(RMSE_SVM)

{'UPDRS_1': -5.616390169380195, 'UPDRS_2': -5.742620682542703, 'UPDRS_3': -12.568384735827516, 'UPDRS_4': -2.511805232218421}


Compare RMSE of light GBM with SVM

In [23]:
print(RMSE_lightgbm)

{'UPDRS_1': -4.744981175550429, 'UPDRS_2': -4.573803939357066, 'UPDRS_3': -10.169609133540808, 'UPDRS_4': -2.0306641374085834}


The RMSE for light GBM is much smaller than SVM, which means it works better.

We will now look into logistic regression and see whether that could improve the RMSE. Logistic regression involves categorical response variables. We will consider each score to be high or low based on the median cut-off and use features to predict the categories

In [31]:
y_train["updrs1_category"]=y_train["updrs_1"].apply(lambda x: 0 if x>y_train.updrs_1.mean() else 1)
y_train["updrs2_category"]=y_train["updrs_2"].apply(lambda x: 0 if x>y_train.updrs_2.mean() else 1)
y_train["updrs3_category"]=y_train["updrs_3"].apply(lambda x: 0 if x>y_train.updrs_3.mean() else 1)
y_train["updrs4_category"]=y_train["updrs_4"].apply(lambda x: 0 if x>y_train.updrs_4.mean() else 1)


We will now use grid search using C which is strength of regularization 

In [32]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
grid={"C":np.logspace(-3,3,7)}
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=10)
logreg_cv.fit(X_train_UPDRS1,y_train.updrs1_category)
logreg_cv.best_params_
print("tuned hyperparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy for UPDRS1 :",logreg_cv.best_score_)

tuned hyperparameters :(best parameters)  {'C': 0.1}
accuracy for UPDRS1 : 0.6919835841313269


In [33]:
logreg_cv.fit(X_train_UPDRS2,y_train.updrs2_category)
print("tuned hyperparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy for UPDRS1 :",logreg_cv.best_score_)

tuned hyperparameters :(best parameters)  {'C': 0.1}
accuracy for UPDRS1 : 0.7129958960328318


In [34]:
logreg_cv.fit(X_train_UPDRS3,y_train.updrs3_category)
print("tuned hyperparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy for UPDRS1 :",logreg_cv.best_score_)

tuned hyperparameters :(best parameters)  {'C': 10.0}
accuracy for UPDRS1 : 0.7645554035567715


In [35]:
logreg_cv.fit(X_train_UPDRS4,y_train.updrs4_category)
print("tuned hyperparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy for UPDRS1 :",logreg_cv.best_score_)

tuned hyperparameters :(best parameters)  {'C': 100.0}
accuracy for UPDRS1 : 0.8044733242134063


**When we look at accuracy based on using categories rather than scores, the accuracy is better than a random 50%. For UPDRS2, UPDRS3 and UPDRS4 scores, the accuracy is above 70%. While for UPDRS1  it is 69%. Predicting categories rather than scores is simpler and our features for each score category do better. 

Let us use the logistic regression model to predict categories for UPDRS scores in our test dataset.

First we will convert test dataset categories 

In [36]:
y_test["updrs1_category"]=y_test["updrs_1"].apply(lambda x: 0 if x>y_test.updrs_1.mean() else 1)
y_test["updrs2_category"]=y_test["updrs_2"].apply(lambda x: 0 if x>y_test.updrs_2.mean() else 1)
y_test["updrs3_category"]=y_test["updrs_3"].apply(lambda x: 0 if x>y_test.updrs_3.mean() else 1)
y_test["updrs4_category"]=y_test["updrs_4"].apply(lambda x: 0 if x>y_test.updrs_4.mean() else 1)

Now we will use selected features for the test dataset for predicting each of the categories. We will also use the regularisation parameters selected for each of the UPDRS scores. We will evaluate the testing dataset by looking at accuracy and F1 scores.

In [39]:
logreg=LogisticRegression(C=0.1)
logreg.fit(X_train_UPDRS1,y_train.updrs1_category)
X_test_UPDRS1=X_test[features_UPDRS1.loc[:,0].tolist()]
y_pred_updrs1 = logreg.predict(X_test_UPDRS1)

# Evaluate model performance on the test data
accuracy_updrs1 = accuracy_score(y_test.updrs1_category, y_pred_updrs1)
print(f'Test Accuracy of UPDRS1: {accuracy_updrs1:.2f}')

F1_score_updrs1= f1_score(y_test.updrs1_category, y_pred_updrs1)
print(f'Test F1 score of UPDRS1: {F1_score_updrs1:.2f}')

Test Accuracy of UPDRS1: 0.76
Test F1 score of UPDRS1: 0.78


In [42]:
logreg=LogisticRegression(C=0.1)
logreg.fit(X_train_UPDRS2,y_train.updrs2_category)
X_test_UPDRS2=X_test[features_UPDRS2.loc[:,0].tolist()]
y_pred_updrs2 = logreg.predict(X_test_UPDRS2)

# Evaluate model performance on the test data
accuracy_updrs2 = accuracy_score(y_test.updrs2_category, y_pred_updrs2)
print(f'Test Accuracy of UPDRS2: {accuracy_updrs2:.2f}')

F1_score_updrs2= f1_score(y_test.updrs2_category,  y_pred_updrs2)
print(f'Test F1 score of UPDRS2: {F1_score_updrs2:.2f}')

Test Accuracy of UPDRS2: 0.72
Test F1 score of UPDRS2: 0.77


In [44]:
logreg=LogisticRegression(C=10)
logreg.fit(X_train_UPDRS3,y_train.updrs3_category)
X_test_UPDRS3=X_test[features_UPDRS3.loc[:,0].tolist()]
y_pred_updrs3 = logreg.predict(X_test_UPDRS3)


accuracy_updrs3 = accuracy_score(y_test.updrs3_category, y_pred_updrs3)
print(f'Test Accuracy of UPDRS3: {accuracy_updrs3:.2f}')

F1_score_updrs3= f1_score(y_test.updrs3_category,  y_pred_updrs3)
print(f'Test F1 score of UPDRS3: {F1_score_updrs3:.2f}')

Test Accuracy of UPDRS3: 0.74
Test F1 score of UPDRS3: 0.73


In [45]:
logreg=LogisticRegression(C=100)
logreg.fit(X_train_UPDRS4,y_train.updrs3_category)
X_test_UPDRS4=X_test[features_UPDRS4.loc[:,0].tolist()]
y_pred_updrs4 = logreg.predict(X_test_UPDRS4)

# Evaluate model performance on the test data
accuracy_updrs4 = accuracy_score(y_test.updrs4_category, y_pred_updrs4)
print(f'Test Accuracy of UPDRS4: {accuracy_updrs4 :.2f}')

F1_score_updrs4= f1_score(y_test.updrs4_category,  y_pred_updrs4)
print(f'Test F1 score of UPDRS4: {F1_score_updrs4:.2f}')

Test Accuracy of UPDRS4: 0.73
Test F1 score of UPDRS4: 0.80


tabulate the results below with F1 scores and Accuracy scores

In [37]:
Scores = [
    ["UPDRS1", round(accuracy_updrs1 ,2),round(F1_score_updrs1, 2)], 
    ["UPDRS2", round(accuracy_updrs2 ,2),round(F1_score_updrs2, 2)], 
    ["UPDRS3", round(accuracy_updrs3, 2),round(F1_score_updrs3, 2)], 
      ["UPDRS4",round(accuracy_updrs4, 2),round(F1_score_updrs4, 2)]
]
 
# create header
head = ["UPDRS", "Test_Accuracy","Test_F1_score"]
 
# display table
print(tabulate(Scores, headers=head, tablefmt="grid"))

+---------+-----------------+-----------------+
| UPDRS   |   Test_Accuracy |   Test_F1_score |
| UPDRS1  |            0.76 |            0.78 |
+---------+-----------------+-----------------+
| UPDRS2  |            0.72 |            0.77 |
+---------+-----------------+-----------------+
| UPDRS3  |            0.74 |            0.73 |
+---------+-----------------+-----------------+
| UPDRS4  |            0.73 |            0.8  |
+---------+-----------------+-----------------+


### **The test value accuracy and F1 scores show that the models perform well when used to predict UPDRS score categories. However, our models which were used to predict the UPDRS scores had poor RMSE values, this indicates that either training dataset was too small or despite using other features along with protein and peptide abundance were insufficient for predicting UPDRS scores 