# Modeling 

In this notebook, we will be predicting the different UPDRS scores for the patients at each time point. We will be using the chosen protein and peptide abundances for the predictions. We will be using three different models light gbm, SVM and logistic regression and testing which one works best and gives the most optimal results. The reason that we chose light gbm rather than traditional gradient boosting models or random forest is its faster training time and higher accuracy. 

Load the libraries 

In [1]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
from numpy import arange
import pandas as pd
import lightgbm
from lightgbm import LGBMClassifier
from bayes_opt import BayesianOptimization
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV, RepeatedKFold
from sklearn.svm import SVC


Load the training and test datasets 

In [2]:
X_train=pd.read_csv("X_train.csv",index_col=0)
y_train=pd.read_csv('y_train.csv',index_col=0)
X_test=pd.read_csv("X_test.csv")
y_test=pd.read_csv('y_test.csv')


Load the selected protein and peptide abundances which are important for each of the UPDRS scores based on the boruta algorithm.

In [3]:
features_UPDRS1=pd.read_csv("features_UPDRS1",header=None)
features_UPDRS2=pd.read_csv("features_UPDRS2",header=None)
features_UPDRS3=pd.read_csv("features_UPDRS3",header=None)
features_UPDRS4=pd.read_csv("features_UPDRS4",header=None)

We will be modelling each of the developed UPDRS scores seperately with a light gbm model and doing hyper parameter tuning to get the parameters with the best cross validation. We will be using bayesian optimization for hyperparameter tuning when using lightgbm with five fold cross validation. Bayesian Optimization is performed over the specified search space (params_bounds) for a number of initial points (init_points) and iterations (n_iter).Apply the Bayesian optimizer to the function we created in the previous step to identify the best hyperparameters. We will run 10 iterations and set init_points = 2.

We will first be selecting the UPDRS 1 features only for X_train and then predicting the UPDRS 1 scores

In [4]:
X_train_UPDRS1=X_train[features_UPDRS1.iloc[:,0].tolist()]

In [5]:
scaler = StandardScaler()
X_tr_scaled_UPDRS1 = scaler.fit_transform(X_train_UPDRS1)

We will now be defining a function for five fold cross-validation with light gbm and apply it to each of the scores. The accuracy metric that we will be using is  RMSE (root mean square error), which measures the average difference between the values predicted by the model compared to the actual values. The RMSE score reported by scikit-learn's scoring mechanism is negative to ensure higher values still indicate better models. R2 values are another metric that can be used which is a scaled version of RMSE from 0 to 1. However, lightgbm regression did not have R2 as a metric so we used RMSE 

In [6]:
def lgb_bayes_optimize(X_train, y_train):
    # Define the evaluation function for Bayesian Optimization
    def lgb_eval(num_leaves, max_depth, lambda_l2, lambda_l1, min_child_samples, min_data_in_leaf):
        params = {"objective" : "regression","metric" : "RMSE",'is_unbalance': True,"num_leaves" : int(num_leaves), "max_depth" : int(max_depth),
 "lambda_l2" : lambda_l2,"lambda_l1" : lambda_l1,"num_threads" : 20, "min_child_samples" : int(min_child_samples), 'min_data_in_leaf': int(min_data_in_leaf),
"learning_rate" : 0.03, "subsample_freq" : 5,"verbosity" : -1}
 # Create LightGBM datasets
        lgtrain = lightgbm.Dataset(X_train, y_train)
# Perform cross-validation with early stopping
        cv_result = lightgbm.cv(params,
                       lgtrain,
                        num_boost_round=100,
                       stratified=False, callbacks=[ lightgbm.early_stopping(stopping_rounds=1000),], nfold=3)
        
        # Return the negative RMSE to be maximized by Bayesian Optimization
        return -1.0 * cv_result['valid rmse-mean'][-1]

    # Define the search space for Bayesian Optimization
    params_bounds = {
        'num_leaves': (25, 4000),
        'max_depth': (5, 63),
        'lambda_l2': (0.0, 0.05),
        'lambda_l1': (0.0, 0.05),
        'min_child_samples': (50, 10000),
        'min_data_in_leaf': (100, 2000)
    }
    
    # Initialize Bayesian Optimization
    lgbBO = BayesianOptimization(lgb_eval, params_bounds, random_state=42)

    # Perform Bayesian Optimization
    lgbBO.maximize(init_points=2, n_iter=10)

    # Get the best parameters
    best_params = lgbBO.max['params']
    #Get rmse
    rmse= lgbBO.max['target']
       
    return  best_params, rmse

Train light gbm wth cross validation and hyperparameter tuning for UPDRS_1

In [7]:
best_params, rmse = lgb_bayes_optimize(X_tr_scaled_UPDRS1,y_train.updrs_1)

print("Best parameters found for UPDRS1:", best_params)
print("RMSE for UPDRS1", rmse)


|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 5.32832 + 0.0533255
| [0m1        [0m | [0m-5.328   [0m | [0m0.01873  [0m | [0m0.04754  [0m | [0m47.46    [0m | [0m6.007e+03[0m | [0m396.4    [0m | [0m645.1    [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[100]	cv_agg's valid rmse: 5.13183 + 0.0427376
| [95m2        [0m | [95m-5.132   [0m | [95m0.002904 [0m | [95m0.04331  [0m | [95m39.86    [0m | [95m7.095e+03[0m | [95m139.1    [0m | [95m3.88e+03 [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 5.32832 + 0.0533255
| 

We would like to see the best parameters and the mean RMSE for the training dataset prediction of UPDRS_1

After cross validation the RMSE  5.11. We can see if adding the minimum visit month difference and visit_month can improve the RMSE.


We have to convert the visit_month_difference NA which is for the visit month 0, which should be converted to 0 before using it as a feature


In [8]:
X_train["visit_month_diff_min"] = X_train["visit_month_diff_min"].fillna(0)

In [32]:
X_train_UPDRS1=X_train[features_UPDRS1.iloc[:,0].tolist()+["visit_month_diff_min","visit_month"]]
X_tr_scaled_UPDRS1 = scaler.fit_transform(X_train_UPDRS1)

best_params, rmse = lgb_bayes_optimize(X_tr_scaled_UPDRS1,y_train.updrs_1)
RMSE_lightgbm={"UPDRS_1": rmse}
print("Best parameters found for UPDRS1:", best_params)
print("RMSE for UPDRS1", rmse)

|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 5.32832 + 0.0533255
| [0m1        [0m | [0m-5.328   [0m | [0m0.01873  [0m | [0m0.04754  [0m | [0m47.46    [0m | [0m6.007e+03[0m | [0m396.4    [0m | [0m645.1    [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[100]	cv_agg's valid rmse: 4.92518 + 0.0778103
| [95m2        [0m | [95m-4.925   [0m | [95m0.002904 [0m | [95m0.04331  [0m | [95m39.86    [0m | [95m7.095e+03[0m | [95m139.1    [0m | [95m3.88e+03 [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 5.32832 + 0.0533255
| 

 RMSE is improved now is 4.86 by adding the visit min difference and visit id. 

Let me see if the same model with UPDRS associated boruta associated features along with adding visit min difference and visit id works well for cross validation for the other UPDRS scores.

In [10]:
X_train_UPDRS2=X_train[features_UPDRS2.iloc[:,0].tolist()+["visit_month_diff_min","visit_month"]]

In [11]:
X_tr_scaled_UPDRS2 = scaler.fit_transform(X_train_UPDRS2)

In [12]:

best_params2, rmse2 = lgb_bayes_optimize(X_tr_scaled_UPDRS2,y_train.updrs_2)

print("Best parameters found for UPDRS1:", best_params2)
print("RMSE for UPDRS1", rmse2)
RMSE_lightgbm={"UPDRS_2": rmse2}

|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 5.92679 + 0.225728
| [0m1        [0m | [0m-5.927   [0m | [0m0.01873  [0m | [0m0.04754  [0m | [0m47.46    [0m | [0m6.007e+03[0m | [0m396.4    [0m | [0m645.1    [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[100]	cv_agg's valid rmse: 5.04551 + 0.203827
| [95m2        [0m | [95m-5.046   [0m | [95m0.002904 [0m | [95m0.04331  [0m | [95m39.86    [0m | [95m7.095e+03[0m | [95m139.1    [0m | [95m3.88e+03 [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 5.92679 + 0.225728
| [0

The best RMSE for lightgbm is 4.90 for predicting UPDRS2

Let us see the performance of light gbm with UPDRS3. We will be using the upd23b_clinical_state_on_medication as a feature as well along with the important protein/peptide abundance scores and visit_month and minimum visit_difference. upd23b_clinical_state_on_medication  is supposed to effect the UPDRS3 scores according to the initially provided information.
Since our previous notebooks showed that upd23b_clinical_state_on_medication had many missing values and this itself may have some significance, we can replace missing values with a value like “Unknown” or “Missing” using the fillna() method. 

In [13]:
X_train["upd23b_clinical_state_on_medication"] = X_train.upd23b_clinical_state_on_medication.fillna("Unknown")

In [14]:
X_train_UPDRS3=X_train[features_UPDRS3.iloc[:,0].tolist()+["visit_month_diff_min","visit_month","upd23b_clinical_state_on_medication"]]

We need to transform the UPDRS3 scores as we have both numerical and categorical variables

In [15]:
numeric_columns = X_train_UPDRS3.select_dtypes(include=['int64', 'float64']).columns
categorical_columns =X_train_UPDRS3.select_dtypes(include=['object', 'bool']).columns

pipeline=ColumnTransformer([
    ('num',StandardScaler(),numeric_columns),
    ('cat',OneHotEncoder(),categorical_columns),
])

X_tr_scaled_UPDRS3=pipeline.fit_transform(X_train_UPDRS3)

In [16]:

best_params3, rmse3 = lgb_bayes_optimize(X_tr_scaled_UPDRS3,y_train.updrs_3)

print("Best parameters found for UPDRS1:", best_params2)
print("RMSE for UPDRS3", rmse3)


|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 14.9414 + 0.321264
| [0m1        [0m | [0m-14.94   [0m | [0m0.01873  [0m | [0m0.04754  [0m | [0m47.46    [0m | [0m6.007e+03[0m | [0m396.4    [0m | [0m645.1    [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[100]	cv_agg's valid rmse: 10.1492 + 0.490032
| [95m2        [0m | [95m-10.15   [0m | [95m0.002904 [0m | [95m0.04331  [0m | [95m39.86    [0m | [95m7.095e+03[0m | [95m139.1    [0m | [95m3.88e+03 [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 14.9414 + 0.321264
| [0


 The best RMSE for UPDRS3 is 9.999

In [18]:
X_train_UPDRS4=X_train[features_UPDRS4.iloc[:,0].tolist()+["visit_month_diff_min","visit_month"]]

X_tr_scaled_UPDRS4 = scaler.fit_transform(X_train_UPDRS4)


best_params4, rmse4 = lgb_bayes_optimize(X_tr_scaled_UPDRS4,y_train.updrs_4)

print("Best parameters found for UPDRS4:", best_params4)
print("RMSE for UPDRS4", rmse4)
RMSE_lightgbm={"UPDRS_4": rmse4}

|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 2.4313 + 0.216878
| [0m1        [0m | [0m-2.431   [0m | [0m0.01873  [0m | [0m0.04754  [0m | [0m47.46    [0m | [0m6.007e+03[0m | [0m396.4    [0m | [0m645.1    [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[100]	cv_agg's valid rmse: 2.25034 + 0.164896
| [95m2        [0m | [95m-2.25    [0m | [95m0.002904 [0m | [95m0.04331  [0m | [95m39.86    [0m | [95m7.095e+03[0m | [95m139.1    [0m | [95m3.88e+03 [0m |
Training until validation scores don't improve for 1000 rounds
Did not meet early stopping. Best iteration is:
[1]	cv_agg's valid rmse: 2.4313 + 0.216878
| [0m3

 The best RMSE for UPDRS4 is 2.20

The RMSE values are very high and show that the target variables are not being predicted by the selected features. In general the RMSE should be within 10% of the mean. When we look at the RMSE we can see that our RMSE are very high meaning that the features show little trend in predicting the UPDRS scores

In [19]:
Mean_UPDRSscores={"UPDRS1":y_train.updrs_1.mean(),"UPDRS2":y_train.updrs_2.mean(),"UPDRS3":y_train.updrs_3.mean(),"UPDRS4":y_train.updrs_4.mean()}
RMSE_lightgbm={"UPDRS_1": rmse,"UPDRS_2": rmse2,"UPDRS_3": rmse3,"UPDRS_4": rmse4}
print(Mean_UPDRSscores)
print(RMSE_lightgbm)

{'UPDRS1': 6.5664794007490634, 'UPDRS2': 5.821161048689139, 'UPDRS3': 17.316479400749063, 'UPDRS4': 1.0168539325842696}
{'UPDRS_1': -4.863244009191125, 'UPDRS_2': -4.90309557004587, 'UPDRS_3': -9.999972075310824, 'UPDRS_4': -2.2070769267466606}


Let me see how elastic net regression performs in cross validation in RMSE using the same features that we used for prediction for each of the UPDRS scores. We will be doing five fold cross validation using grid search for hyper parameter tuning looking at different kernels and regularization parameter C. We can do an exhaustive search using grid search looking at all possible combinations rather than a smart bayesian based approach as we have fewer parameters to tune.

In [20]:
def pred(x,y):
    param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf', 'poly']}
    grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3,scoring="neg_root_mean_squared_error") 
  # fitting the models for grid search 
    grid.fit(x, y) 
    best_params = grid.best_params_
    best_score = grid.best_score_
    return(best_params,best_score)

In [33]:
SVM_updrs1=pred(X_tr_scaled_UPDRS1, y_train.updrs_1)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END .............C=0.1, kernel=linear;, score=-6.610 total time=   0.1s
[CV 2/5] END .............C=0.1, kernel=linear;, score=-6.072 total time=   0.0s
[CV 3/5] END .............C=0.1, kernel=linear;, score=-6.635 total time=   0.0s
[CV 4/5] END .............C=0.1, kernel=linear;, score=-6.508 total time=   0.0s
[CV 5/5] END .............C=0.1, kernel=linear;, score=-5.963 total time=   0.0s
[CV 1/5] END ................C=0.1, kernel=rbf;, score=-7.741 total time=   0.1s
[CV 2/5] END ................C=0.1, kernel=rbf;, score=-7.751 total time=   0.1s
[CV 3/5] END ................C=0.1, kernel=rbf;, score=-7.929 total time=   0.1s
[CV 4/5] END ................C=0.1, kernel=rbf;, score=-7.560 total time=   0.1s
[CV 5/5] END ................C=0.1, kernel=rbf;, score=-7.540 total time=   0.1s
[CV 1/5] END ...............C=0.1, kernel=poly;, score=-7.296 total time=   0.0s
[CV 2/5] END ...............C=0.1, kernel=poly;, 

In [23]:
SVM_updrs2=pred(X_tr_scaled_UPDRS2, y_train.updrs_2)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END .............C=0.1, kernel=linear;, score=-6.657 total time=   0.1s
[CV 2/5] END .............C=0.1, kernel=linear;, score=-6.594 total time=   0.0s
[CV 3/5] END .............C=0.1, kernel=linear;, score=-6.570 total time=   0.0s
[CV 4/5] END .............C=0.1, kernel=linear;, score=-5.945 total time=   0.0s
[CV 5/5] END .............C=0.1, kernel=linear;, score=-6.342 total time=   0.0s
[CV 1/5] END ................C=0.1, kernel=rbf;, score=-8.194 total time=   0.1s
[CV 2/5] END ................C=0.1, kernel=rbf;, score=-8.314 total time=   0.1s
[CV 3/5] END ................C=0.1, kernel=rbf;, score=-8.503 total time=   0.1s
[CV 4/5] END ................C=0.1, kernel=rbf;, score=-8.356 total time=   0.1s
[CV 5/5] END ................C=0.1, kernel=rbf;, score=-8.172 total time=   0.1s
[CV 1/5] END ...............C=0.1, kernel=poly;, score=-8.088 total time=   0.1s
[CV 2/5] END ...............C=0.1, kernel=poly;, 

In [24]:
SVM_updrs3=pred(X_tr_scaled_UPDRS3, y_train.updrs_3)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END ............C=0.1, kernel=linear;, score=-13.557 total time=   0.1s
[CV 2/5] END ............C=0.1, kernel=linear;, score=-14.120 total time=   0.1s
[CV 3/5] END ............C=0.1, kernel=linear;, score=-14.274 total time=   0.1s
[CV 4/5] END ............C=0.1, kernel=linear;, score=-12.144 total time=   0.1s
[CV 5/5] END ............C=0.1, kernel=linear;, score=-14.439 total time=   0.1s
[CV 1/5] END ...............C=0.1, kernel=rbf;, score=-22.653 total time=   0.1s
[CV 2/5] END ...............C=0.1, kernel=rbf;, score=-23.443 total time=   0.1s
[CV 3/5] END ...............C=0.1, kernel=rbf;, score=-23.063 total time=   0.1s
[CV 4/5] END ...............C=0.1, kernel=rbf;, score=-22.485 total time=   0.1s
[CV 5/5] END ...............C=0.1, kernel=rbf;, score=-22.693 total time=   0.1s
[CV 1/5] END ..............C=0.1, kernel=poly;, score=-22.284 total time=   0.1s
[CV 2/5] END ..............C=0.1, kernel=poly;, s

In [26]:
SVM_updrs4=pred(X_tr_scaled_UPDRS4, y_train.updrs_4)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV 1/5] END .............C=0.1, kernel=linear;, score=-2.676 total time=   0.0s
[CV 2/5] END .............C=0.1, kernel=linear;, score=-2.880 total time=   0.0s
[CV 3/5] END .............C=0.1, kernel=linear;, score=-2.630 total time=   0.0s
[CV 4/5] END .............C=0.1, kernel=linear;, score=-2.441 total time=   0.0s
[CV 5/5] END .............C=0.1, kernel=linear;, score=-2.560 total time=   0.0s
[CV 1/5] END ................C=0.1, kernel=rbf;, score=-2.676 total time=   0.0s
[CV 2/5] END ................C=0.1, kernel=rbf;, score=-2.880 total time=   0.0s
[CV 3/5] END ................C=0.1, kernel=rbf;, score=-2.630 total time=   0.0s
[CV 4/5] END ................C=0.1, kernel=rbf;, score=-2.441 total time=   0.0s
[CV 5/5] END ................C=0.1, kernel=rbf;, score=-2.560 total time=   0.0s
[CV 1/5] END ...............C=0.1, kernel=poly;, score=-2.676 total time=   0.0s
[CV 2/5] END ...............C=0.1, kernel=poly;, 

In [27]:
RMSE_SVM={"UPDRS_1":SVM_updrs1[1],"UPDRS_2":SVM_updrs2[1],"UPDRS_3":SVM_updrs3[1],"UPDRS_4":SVM_updrs4[1]}
print(RMSE_SVM)

{'UPDRS_1': -6.866021647567303, 'UPDRS_2': -6.4134475293561195, 'UPDRS_3': -13.706821256295111, 'UPDRS_4': -2.6147939983340622}


Compare RMSE of light GBM with SVM

In [28]:
print(RMSE_lightgbm)

{'UPDRS_1': -4.863244009191125, 'UPDRS_2': -4.90309557004587, 'UPDRS_3': -9.999972075310824, 'UPDRS_4': -2.2070769267466606}


The RMSE for light GBM is much smaller than SVM.

In [74]:
X_train["patient_status"]=X_train["visit_month_diff_min"].apply(lambda x: "Less_severe" if x==12 else "Severe")

We will now look into logistic regression and see whether that could improve the RMSE. Logistic regression involves categorical response variables. We will consider each score to be high or low based on the median cut-off and use features to predict the categories

In [31]:
y_train["updrs1_category"]=y_train["updrs_1"].apply(lambda x: 0 if x>y_train.updrs_1.mean() else 1)
y_train["updrs2_category"]=y_train["updrs_2"].apply(lambda x: 0 if x>y_train.updrs_2.mean() else 1)
y_train["updrs3_category"]=y_train["updrs_3"].apply(lambda x: 0 if x>y_train.updrs_3.mean() else 1)
y_train["updrs4_category"]=y_train["updrs_4"].apply(lambda x: 0 if x>y_train.updrs_4.mean() else 1)


We will now use grid search using C which is strength of regularization 

In [36]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
grid={"C":np.logspace(-3,3,7)}
logreg=LogisticRegression()
logreg_cv=GridSearchCV(logreg,grid,cv=10)
logreg_cv.fit(X_tr_scaled_UPDRS1,y_train.updrs1_category)

print("tuned hyperparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy for UPDRS1 :",logreg_cv.best_score_)

tuned hyperparameters :(best parameters)  {'C': 0.01}
accuracy for UPDRS1 : 0.6497795803209312


In [37]:
logreg_cv.fit(X_tr_scaled_UPDRS2,y_train.updrs2_category)
print("tuned hyperparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy for UPDRS1 :",logreg_cv.best_score_)

tuned hyperparameters :(best parameters)  {'C': 0.01}
accuracy for UPDRS1 : 0.680700052900723


In [38]:
logreg_cv.fit(X_tr_scaled_UPDRS3,y_train.updrs3_category)
print("tuned hyperparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy for UPDRS1 :",logreg_cv.best_score_)

tuned hyperparameters :(best parameters)  {'C': 0.1}
accuracy for UPDRS1 : 0.72093986951155


In [39]:
logreg_cv.fit(X_tr_scaled_UPDRS4,y_train.updrs4_category)
print("tuned hyperparameters :(best parameters) ",logreg_cv.best_params_)
print("accuracy for UPDRS1 :",logreg_cv.best_score_)

tuned hyperparameters :(best parameters)  {'C': 10.0}
accuracy for UPDRS1 : 0.8136836536766003


When we look at accuracy based on using categories rather than scores, the accuracy is better than a random 50%. For UPDRS3 and UPDRS4 scores, the accuracy is above 70%. While for UPDRS1 and UPDRS2 it is 65% and 68% respctively. Predicting categories rather than scores is simpler and our features for each score do better. 