# Introduction

The goal of this competition is to **predict MDS-UPDR scores**, which measure progression **in patients with Parkinson's disease**. The Movement Disorder Society-Sponsored Revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS) is a comprehensive assessment of both motor and non-motor symptoms associated with Parkinson's.

We have to **develop a model trained on data of protein and peptide levels over time** in subjects with Parkinson’s disease versus normal age-matched control subjects.

However, there are **a huge number of peptides** in the dataset. **If all the peptides are used as independent variables, the model will not be able to predict the target variables effectively.** The model will be **too complicate**. Moreover, **strong correlation between independent variables causes multicollinearity**. Thus, we have to **select independent variables**. Here, we focus on features selection **to improve the model fitness**.

# Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Define Metric SMAPE

In [2]:
def smape(y_true, y_pred):
    smap = np.zeros(len(y_true))
    
    num = np.abs(y_true - y_pred)
    dem = ((np.abs(y_true) + np.abs(y_pred)) / 2)
    
    pos_ind = (y_true != 0)|(y_pred != 0)
    smap[pos_ind] = num[pos_ind] / dem[pos_ind]
    
    return 100 * np.mean(smap)

# Create Dataset (Peptides)

We read the train CSV files and integrate them into one dataset. The process is explained at my another notebook: [Parkinson's Disease MDS-UPDRS End-to-End Baseline](https://www.kaggle.com/code/gokifujiya/parkinson-s-disease-mds-updrs-end-to-end-baseline#Import-Libraries).

## Proteins Data

In [3]:
proteins = pd.read_csv('/kaggle/input/amp-parkinsons-disease-progression-prediction/train_proteins.csv')
print('Proteins shape:',proteins.shape)
proteins.head()

Proteins shape: (232741, 5)


Unnamed: 0,visit_id,visit_month,patient_id,UniProt,NPX
0,55_0,0,55,O00391,11254.3
1,55_0,0,55,O00533,732430.0
2,55_0,0,55,O00584,39585.8
3,55_0,0,55,O14498,41526.9
4,55_0,0,55,O14773,31238.0


## Peptides Data

In [4]:
peptides = pd.read_csv('/kaggle/input/amp-parkinsons-disease-progression-prediction/train_peptides.csv')
print('Peptides shape:', peptides.shape)
peptides.head()

Peptides shape: (981834, 6)


Unnamed: 0,visit_id,visit_month,patient_id,UniProt,Peptide,PeptideAbundance
0,55_0,0,55,O00391,NEQEQPLGQWHLS,11254.3
1,55_0,0,55,O00533,GNPEPTFSWTK,102060.0
2,55_0,0,55,O00533,IEIPSSVQQVPTIIK,174185.0
3,55_0,0,55,O00533,KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK,27278.9
4,55_0,0,55,O00533,SMEQNGPGLEYR,30838.7


## Clinical Data

In [5]:
clinical = pd.read_csv('/kaggle/input/amp-parkinsons-disease-progression-prediction/train_clinical_data.csv')
print('Clinical shape:', clinical.shape)
clinical.head()

Clinical shape: (2615, 8)


Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication
0,55_0,55,0,10.0,6.0,15.0,,
1,55_3,55,3,10.0,7.0,25.0,,
2,55_6,55,6,8.0,10.0,34.0,,
3,55_9,55,9,8.0,9.0,30.0,0.0,On
4,55_12,55,12,10.0,10.0,41.0,0.0,On


## Merge the Train Data

In [6]:
# Merge the proteins data and peptides data on the common columns.
merged_proteins_peptides = pd.merge(proteins, peptides, on = ['visit_id', 'visit_month', 'patient_id', 'UniProt'])

# Merge the merged protein-peptides data with the clinical data on the common columns.
merged = pd.merge(merged_proteins_peptides, clinical, on = ['visit_id', 'visit_month', 'patient_id'])

# Show the merged data.
merged

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,NPX,Peptide,PeptideAbundance,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication
0,55_0,0,55,O00391,11254.3,NEQEQPLGQWHLS,11254.30,10.0,6.0,15.0,,
1,55_0,0,55,O00533,732430.0,GNPEPTFSWTK,102060.00,10.0,6.0,15.0,,
2,55_0,0,55,O00533,732430.0,IEIPSSVQQVPTIIK,174185.00,10.0,6.0,15.0,,
3,55_0,0,55,O00533,732430.0,KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK,27278.90,10.0,6.0,15.0,,
4,55_0,0,55,O00533,732430.0,SMEQNGPGLEYR,30838.70,10.0,6.0,15.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...
941739,58648_108,108,58648,Q9UHG2,369437.0,ILAGSADSEGVAAPR,202820.00,6.0,0.0,0.0,,
941740,58648_108,108,58648,Q9UKV8,105830.0,SGNIPAGTTVDTK,105830.00,6.0,0.0,0.0,,
941741,58648_108,108,58648,Q9Y646,21257.6,LALLVDTVGPR,21257.60,6.0,0.0,0.0,,
941742,58648_108,108,58648,Q9Y6R7,17953.1,AGC(UniMod_4)VAESTAVC(UniMod_4)R,5127.26,6.0,0.0,0.0,,


In [7]:
# Pivot the data.
pivoted = merged.pivot(index = 'visit_id', columns = ['Peptide'], values = 'PeptideAbundance')

# See the pivoted data.
pivoted

Peptide,AADDTWEPFASGK,AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K,AAFTEC(UniMod_4)C(UniMod_4)QAADK,AANEVSSADVK,AATGEC(UniMod_4)TATVGKR,AATVGSLAGQPLQER,AAVYHHFISDGVR,ADDKETC(UniMod_4)FAEEGK,ADDKETC(UniMod_4)FAEEGKK,ADDLGKGGNEESTKTGNAGSR,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,6580710.0,31204.4,7735070.0,,,,46620.3,236144.0,,,...,202274.0,,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,,7207.30
10053_12,6333510.0,52277.6,5394390.0,,,,57554.5,108298.0,45885.4,,...,201009.0,,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.80
10053_18,7129640.0,61522.0,7011920.0,35984.7,17188.00,19787.3,36029.4,708729.0,5067790.0,30838.2,...,220728.0,,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.70
10138_12,7404780.0,46107.2,10610900.0,,20910.20,66662.3,55253.9,79575.5,6201210.0,26720.0,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,13788300.0,56910.3,6906160.0,13785.5,11004.20,63672.7,36819.8,34160.9,2117430.0,15645.2,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,,56977.6,4903.09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8699_24,6312970.0,44462.7,12455000.0,11051.3,1163.18,43279.8,67743.5,325328.0,4666550.0,11038.5,...,289888.0,8615.27,8770410.0,33599.1,926094.0,118897.0,133682.0,571879.0,80268.3,54889.70
942_12,11289900.0,46111.7,11297300.0,,13894.10,53755.0,40289.3,565112.0,,26495.8,...,173259.0,4767.63,374307.0,35767.3,250397.0,65966.9,77976.8,486239.0,45032.7,
942_24,10161900.0,32145.0,12388000.0,25869.2,17341.80,48625.5,45223.9,84448.0,4684800.0,23150.2,...,185428.0,5554.53,,64049.8,479473.0,68505.7,74483.1,561398.0,52916.4,21847.60
942_48,8248490.0,30563.4,11882600.0,,19114.90,60221.4,46685.9,81282.9,5542110.0,21804.0,...,137611.0,6310.09,,28008.8,231359.0,63265.8,64601.8,632782.0,51123.7,20700.30


In [8]:
# Add visit_month, the 4 scores, and medication status.
df = pd.merge(clinical, pivoted, on = 'visit_id', how = 'right').set_index('visit_id')
df

Unnamed: 0_level_0,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication,AADDTWEPFASGK,AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K,AAFTEC(UniMod_4)C(UniMod_4)QAADK,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,10053,0,3.0,0.0,13.0,0.0,,6580710.0,31204.4,7735070.0,...,202274.0,,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,,7207.30
10053_12,10053,12,4.0,2.0,8.0,0.0,,6333510.0,52277.6,5394390.0,...,201009.0,,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.80
10053_18,10053,18,2.0,2.0,0.0,0.0,,7129640.0,61522.0,7011920.0,...,220728.0,,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.70
10138_12,10138,12,3.0,6.0,31.0,0.0,On,7404780.0,46107.2,10610900.0,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,10138,24,4.0,7.0,19.0,10.0,On,13788300.0,56910.3,6906160.0,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,,56977.6,4903.09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8699_24,8699,24,11.0,10.0,13.0,2.0,On,6312970.0,44462.7,12455000.0,...,289888.0,8615.27,8770410.0,33599.1,926094.0,118897.0,133682.0,571879.0,80268.3,54889.70
942_12,942,12,5.0,2.0,25.0,0.0,,11289900.0,46111.7,11297300.0,...,173259.0,4767.63,374307.0,35767.3,250397.0,65966.9,77976.8,486239.0,45032.7,
942_24,942,24,2.0,3.0,23.0,,,10161900.0,32145.0,12388000.0,...,185428.0,5554.53,,64049.8,479473.0,68505.7,74483.1,561398.0,52916.4,21847.60
942_48,942,48,2.0,6.0,35.0,0.0,,8248490.0,30563.4,11882600.0,...,137611.0,6310.09,,28008.8,231359.0,63265.8,64601.8,632782.0,51123.7,20700.30


In [9]:
# Insert the visit_month column to the desired position.
df.insert(6, 'visit_month', df.pop('visit_month'))
df

Unnamed: 0_level_0,patient_id,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication,visit_month,AADDTWEPFASGK,AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K,AAFTEC(UniMod_4)C(UniMod_4)QAADK,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,10053,3.0,0.0,13.0,0.0,,0,6580710.0,31204.4,7735070.0,...,202274.0,,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,,7207.30
10053_12,10053,4.0,2.0,8.0,0.0,,12,6333510.0,52277.6,5394390.0,...,201009.0,,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.80
10053_18,10053,2.0,2.0,0.0,0.0,,18,7129640.0,61522.0,7011920.0,...,220728.0,,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.70
10138_12,10138,3.0,6.0,31.0,0.0,On,12,7404780.0,46107.2,10610900.0,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,10138,4.0,7.0,19.0,10.0,On,24,13788300.0,56910.3,6906160.0,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,,56977.6,4903.09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8699_24,8699,11.0,10.0,13.0,2.0,On,24,6312970.0,44462.7,12455000.0,...,289888.0,8615.27,8770410.0,33599.1,926094.0,118897.0,133682.0,571879.0,80268.3,54889.70
942_12,942,5.0,2.0,25.0,0.0,,12,11289900.0,46111.7,11297300.0,...,173259.0,4767.63,374307.0,35767.3,250397.0,65966.9,77976.8,486239.0,45032.7,
942_24,942,2.0,3.0,23.0,,,24,10161900.0,32145.0,12388000.0,...,185428.0,5554.53,,64049.8,479473.0,68505.7,74483.1,561398.0,52916.4,21847.60
942_48,942,2.0,6.0,35.0,0.0,,48,8248490.0,30563.4,11882600.0,...,137611.0,6310.09,,28008.8,231359.0,63265.8,64601.8,632782.0,51123.7,20700.30


In [10]:
df = df.drop('patient_id', axis = 1)
df

Unnamed: 0_level_0,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication,visit_month,AADDTWEPFASGK,AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K,AAFTEC(UniMod_4)C(UniMod_4)QAADK,AANEVSSADVK,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,3.0,0.0,13.0,0.0,,0,6580710.0,31204.4,7735070.0,,...,202274.0,,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,,7207.30
10053_12,4.0,2.0,8.0,0.0,,12,6333510.0,52277.6,5394390.0,,...,201009.0,,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.80
10053_18,2.0,2.0,0.0,0.0,,18,7129640.0,61522.0,7011920.0,35984.7,...,220728.0,,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.70
10138_12,3.0,6.0,31.0,0.0,On,12,7404780.0,46107.2,10610900.0,,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,4.0,7.0,19.0,10.0,On,24,13788300.0,56910.3,6906160.0,13785.5,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,,56977.6,4903.09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8699_24,11.0,10.0,13.0,2.0,On,24,6312970.0,44462.7,12455000.0,11051.3,...,289888.0,8615.27,8770410.0,33599.1,926094.0,118897.0,133682.0,571879.0,80268.3,54889.70
942_12,5.0,2.0,25.0,0.0,,12,11289900.0,46111.7,11297300.0,,...,173259.0,4767.63,374307.0,35767.3,250397.0,65966.9,77976.8,486239.0,45032.7,
942_24,2.0,3.0,23.0,,,24,10161900.0,32145.0,12388000.0,25869.2,...,185428.0,5554.53,,64049.8,479473.0,68505.7,74483.1,561398.0,52916.4,21847.60
942_48,2.0,6.0,35.0,0.0,,48,8248490.0,30563.4,11882600.0,,...,137611.0,6310.09,,28008.8,231359.0,63265.8,64601.8,632782.0,51123.7,20700.30


In [11]:
# Replace NaN with 0 in the Peptides columns.
df.loc[:, 'AADDTWEPFASGK':] = df.loc[:, 'AADDTWEPFASGK':].fillna(0)
df

Unnamed: 0_level_0,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication,visit_month,AADDTWEPFASGK,AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K,AAFTEC(UniMod_4)C(UniMod_4)QAADK,AANEVSSADVK,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,3.0,0.0,13.0,0.0,,0,6580710.0,31204.4,7735070.0,0.0,...,202274.0,0.00,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,0.0,7207.30
10053_12,4.0,2.0,8.0,0.0,,12,6333510.0,52277.6,5394390.0,0.0,...,201009.0,0.00,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.80
10053_18,2.0,2.0,0.0,0.0,,18,7129640.0,61522.0,7011920.0,35984.7,...,220728.0,0.00,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.70
10138_12,3.0,6.0,31.0,0.0,On,12,7404780.0,46107.2,10610900.0,0.0,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,4.0,7.0,19.0,10.0,On,24,13788300.0,56910.3,6906160.0,13785.5,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,0.0,56977.6,4903.09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8699_24,11.0,10.0,13.0,2.0,On,24,6312970.0,44462.7,12455000.0,11051.3,...,289888.0,8615.27,8770410.0,33599.1,926094.0,118897.0,133682.0,571879.0,80268.3,54889.70
942_12,5.0,2.0,25.0,0.0,,12,11289900.0,46111.7,11297300.0,0.0,...,173259.0,4767.63,374307.0,35767.3,250397.0,65966.9,77976.8,486239.0,45032.7,0.00
942_24,2.0,3.0,23.0,,,24,10161900.0,32145.0,12388000.0,25869.2,...,185428.0,5554.53,0.0,64049.8,479473.0,68505.7,74483.1,561398.0,52916.4,21847.60
942_48,2.0,6.0,35.0,0.0,,48,8248490.0,30563.4,11882600.0,0.0,...,137611.0,6310.09,0.0,28008.8,231359.0,63265.8,64601.8,632782.0,51123.7,20700.30


In [12]:
# Drop upd23b_clinical_state_on_medication column from the previous merged train dataset df.
df = df.drop('upd23b_clinical_state_on_medication', axis = 1)
df

Unnamed: 0_level_0,updrs_1,updrs_2,updrs_3,updrs_4,visit_month,AADDTWEPFASGK,AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K,AAFTEC(UniMod_4)C(UniMod_4)QAADK,AANEVSSADVK,AATGEC(UniMod_4)TATVGKR,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,3.0,0.0,13.0,0.0,0,6580710.0,31204.4,7735070.0,0.0,0.00,...,202274.0,0.00,4401830.0,77482.6,583075.0,76705.7,104260.0,530223.0,0.0,7207.30
10053_12,4.0,2.0,8.0,0.0,12,6333510.0,52277.6,5394390.0,0.0,0.00,...,201009.0,0.00,5001750.0,36745.3,355643.0,92078.1,123254.0,453883.0,49281.9,25332.80
10053_18,2.0,2.0,0.0,0.0,18,7129640.0,61522.0,7011920.0,35984.7,17188.00,...,220728.0,0.00,5424380.0,39016.0,496021.0,63203.6,128336.0,447505.0,52389.1,21235.70
10138_12,3.0,6.0,31.0,0.0,12,7404780.0,46107.2,10610900.0,0.0,20910.20,...,188362.0,9433.71,3900280.0,48210.3,328482.0,89822.1,129964.0,552232.0,65657.8,9876.98
10138_24,4.0,7.0,19.0,10.0,24,13788300.0,56910.3,6906160.0,13785.5,11004.20,...,206187.0,6365.15,3521800.0,69984.6,496737.0,80919.3,111799.0,0.0,56977.6,4903.09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8699_24,11.0,10.0,13.0,2.0,24,6312970.0,44462.7,12455000.0,11051.3,1163.18,...,289888.0,8615.27,8770410.0,33599.1,926094.0,118897.0,133682.0,571879.0,80268.3,54889.70
942_12,5.0,2.0,25.0,0.0,12,11289900.0,46111.7,11297300.0,0.0,13894.10,...,173259.0,4767.63,374307.0,35767.3,250397.0,65966.9,77976.8,486239.0,45032.7,0.00
942_24,2.0,3.0,23.0,,24,10161900.0,32145.0,12388000.0,25869.2,17341.80,...,185428.0,5554.53,0.0,64049.8,479473.0,68505.7,74483.1,561398.0,52916.4,21847.60
942_48,2.0,6.0,35.0,0.0,48,8248490.0,30563.4,11882600.0,0.0,19114.90,...,137611.0,6310.09,0.0,28008.8,231359.0,63265.8,64601.8,632782.0,51123.7,20700.30


# Linear Regression Model without Features Selection

At first, we try to **create, train, and evaluate linear regression models with all the peptides as independent variables**.

This time, we will **predict updrs_1 (y_1), updrs_2 (y_2), updrs_3 (y_3), and updrs_4 (y_4) separately from the other columns** as independent variables. 

## updrs_1

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler

# Separate the dataset for updrs_1.
df_updrs1 = df[['updrs_1'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs1 = df_updrs1.iloc[:, 1:]
y_updrs1 = df_updrs1.iloc[:, 0]

# Split the dataset into training and testing sets.
X_train_updrs1, X_test_updrs1, y_train_updrs1, y_test_updrs1 = train_test_split(X_updrs1, y_updrs1, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler1 = StandardScaler()
X_train_updrs1 = scaler1.fit_transform(X_train_updrs1)
X_test_updrs1 = scaler1.transform(X_test_updrs1)

# Fit a linear regression model on the training set.
model_updrs1 = LinearRegression()
model_updrs1.fit(X_train_updrs1, y_train_updrs1)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs1 = model_updrs1.predict(X_test_updrs1)
y_pred_updrs1 = np.where(y_pred_updrs1 < 0, 0, y_pred_updrs1)

# Evaluate the performance of the model.
mse_updrs1 = mean_squared_error(y_test_updrs1, y_pred_updrs1)
mae_updrs1 = mean_absolute_error(y_test_updrs1, y_pred_updrs1)
r2_updrs1 = r2_score(y_test_updrs1, y_pred_updrs1)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_1
print("mse_updrs1:", mse_updrs1)
print("mae_updrs1:", mae_updrs1)
print("r2_updrs1:", r2_updrs1)
print("SMAPE_updrs1:", smape(y_test_updrs1, y_pred_updrs1))

mse_updrs1: 181.36256723384935
mae_updrs1: 8.976865499013966
r2_updrs1: -7.243777603020284
SMAPE_updrs1: 123.65426134982795


## updrs_2

In [14]:
# Separate the dataset for updrs_2.
df_updrs2 = df[['updrs_2'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs2 = df_updrs2.iloc[:, 1:]
y_updrs2 = df_updrs2.iloc[:, 0]

# Split the dataset into training and testing sets.
X_train_updrs2, X_test_updrs2, y_train_updrs2, y_test_updrs2 = train_test_split(X_updrs2, y_updrs2, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler2 = StandardScaler()
X_train_updrs2 = scaler2.fit_transform(X_train_updrs2)
X_test_updrs2 = scaler2.transform(X_test_updrs2)

# Fit a linear regression model on the training set.
model_updrs2 = LinearRegression()
model_updrs2.fit(X_train_updrs2, y_train_updrs2)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs2 = model_updrs2.predict(X_test_updrs2)
y_pred_updrs2 = np.where(y_pred_updrs2 < 0, 0, y_pred_updrs2)

# Evaluate the performance of the model.
mse_updrs2 = mean_squared_error(y_test_updrs2, y_pred_updrs2)
mae_updrs2 = mean_absolute_error(y_test_updrs2, y_pred_updrs2)
r2_updrs2 = r2_score(y_test_updrs2, y_pred_updrs2)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_2
print("mse_updrs2:", mse_updrs2)
print("mae_updrs2:", mae_updrs2)
print("r2_updrs2:", r2_updrs2)
print("SMAPE_updrs2:", smape(y_test_updrs2, y_pred_updrs2))

mse_updrs2: 141.06138680197282
mae_updrs2: 7.715336937771148
r2_updrs2: -2.962616282911034
SMAPE_updrs2: 119.04055941799506


## updrs_3

In [15]:
# Separate the dataset for updrs_3.
df_updrs3 = df[['updrs_3'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs3 = df_updrs3.iloc[:, 1:]
y_updrs3 = df_updrs3.iloc[:, 0]

# Split the dataset into training and testing sets.
X_train_updrs3, X_test_updrs3, y_train_updrs3, y_test_updrs3 = train_test_split(X_updrs3, y_updrs3, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler3 = StandardScaler()
X_train_updrs3 = scaler3.fit_transform(X_train_updrs3)
X_test_updrs3 = scaler3.transform(X_test_updrs3)

# Fit a linear regression model on the training set.
model_updrs3 = LinearRegression()
model_updrs3.fit(X_train_updrs3, y_train_updrs3)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs3 = model_updrs3.predict(X_test_updrs3)
y_pred_updrs3 = np.where(y_pred_updrs3 < 0, 0, y_pred_updrs3)

# Evaluate the performance of the model.
mse_updrs3 = mean_squared_error(y_test_updrs3, y_pred_updrs3)
mae_updrs3 = mean_absolute_error(y_test_updrs3, y_pred_updrs3)
r2_updrs3 = r2_score(y_test_updrs3, y_pred_updrs3)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_3
print("mse_updrs3:", mse_updrs3)
print("mae_updrs3:", mae_updrs3)
print("r2_updrs3:", r2_updrs3)
print("SMAPE_updrs3:", smape(y_test_updrs3, y_pred_updrs3))

mse_updrs3: 886.6379576884674
mae_updrs3: 21.082451759310874
r2_updrs3: -2.56851426385946
SMAPE_updrs3: 115.79585559191148


## updrs_4

In [16]:
# Separate the dataset for updrs_4.
df_updrs4 = df[['updrs_4'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs4 = df_updrs4.iloc[:, 1:]
y_updrs4 = df_updrs4.iloc[:, 0]

# Split the dataset into training and testing sets.
X_train_updrs4, X_test_updrs4, y_train_updrs4, y_test_updrs4 = train_test_split(X_updrs4, y_updrs4, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler4 = StandardScaler()
X_train_updrs4 = scaler4.fit_transform(X_train_updrs4)
X_test_updrs4 = scaler4.transform(X_test_updrs4)

# Fit a linear regression model on the training set.
model_updrs4 = LinearRegression()
model_updrs4.fit(X_train_updrs4, y_train_updrs4)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs4 = model_updrs4.predict(X_test_updrs4)
y_pred_updrs4 = np.where(y_pred_updrs4 < 0, 0, y_pred_updrs4)

# Evaluate the performance of the model.
mse_updrs4 = mean_squared_error(y_test_updrs4, y_pred_updrs4)
mae_updrs4 = mean_absolute_error(y_test_updrs4, y_pred_updrs4)
r2_updrs4 = r2_score(y_test_updrs4, y_pred_updrs4)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_4
print("mse_updrs4:", mse_updrs4)
print("mae_updrs4:", mae_updrs4)
print("r2_updrs4:", r2_updrs4)
print("SMAPE_updrs4:", smape(y_test_updrs4, y_pred_updrs4))

mse_updrs4: 11.084991005755768
mae_updrs4: 2.387561374234199
r2_updrs4: -0.5014282911838785
SMAPE_updrs4: 122.63621479648548


## Results

In [17]:
# Add a title to the DataFrame.
print("The Results without Features Selection")

# Create a dictionary with the metrics for each target.
metrics_dict_all = {
    'Target': ['UPDRS 1', 'UPDRS 2', 'UPDRS 3', 'UPDRS 4'],
    'MSE': [mse_updrs1, mse_updrs2, mse_updrs3, mse_updrs4],
    'MAE': [mae_updrs1, mae_updrs2, mae_updrs3, mae_updrs4],
    'R2': [r2_updrs1, r2_updrs2, r2_updrs3, r2_updrs4],
    'SMAPE': [smape(y_test_updrs1, y_pred_updrs1), smape(y_test_updrs2, y_pred_updrs2), 
              smape(y_test_updrs3, y_pred_updrs3), smape(y_test_updrs4, y_pred_updrs4)]
}

# Create a Pandas DataFrame from the dictionary.
metrics_df_all = pd.DataFrame(metrics_dict_all)

# Set the 'Target' column as the index.
metrics_df_all.set_index('Target', inplace = True)

# Display the DataFrame.
metrics_df_all

The Results without Features Selection


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,181.362567,8.976865,-7.243778,123.654261
UPDRS 2,141.061387,7.715337,-2.962616,119.040559
UPDRS 3,886.637958,21.082452,-2.568514,115.795856
UPDRS 4,11.084991,2.387561,-0.501428,122.636215


To evaluate the results of the linear regression model, we can look at **the mean squared error (MSE), mean absolute error (MAE), R-squared (R2), and symmetric mean absolute percentage error (SMAPE)** for each of the four UPDRS scores (UPDRS 1-4).

**Generally, a MSE, MAE, or SMAPE value of 0 indicates a perfect performance of the model, while higher values indicate a worse fit. A R2 value of 1 indicates a perfect fit, while lower values indicate a worse fit.**

It seems that the metrics, such as the SMAPE values are considerably high. This could indicate that there are **large differences between the predicted values and the true values**.

# Select Features

There are several techniques we can use to select features from a large number of independent variables:

   1. **Univariate Feature Selection**: This method selects the features with **the highest correlation with the target variable** using statistical tests like chi-squared test, ANOVA F-test, mutual information, etc.

   2. **Recursive Feature Elimination**: This method **recursively removes features from the dataset and selects the features that contribute the most to the model's accuracy**.

   3. **Principal Component Analysis (PCA)**: PCA is a **dimensionality reduction** technique that transforms the original features **into a new set of uncorrelated features** called principal components. We can select the top principal components that explain the majority of the variance in the data.

   4. **Regularization Methods**: **Lasso and Ridge regression** are two popular regularization methods that **shrink the coefficients of the less important features to zero**, leaving only the most important features in the model.

   5. **Tree-Based Methods**: Tree-based models **like Random Forest and XGBoost** can be used **to rank the importance of the features** based on their contribution to the model's accuracy.

We can also combine multiple feature selection techniques to get a more accurate and robust feature set.

# Linear Regression Model with Univariate Feature Selection

To perform Univariate Feature Selection, we can use **the SelectKBest class from the scikit-learn library**. we are using **the F-test score (f_regression)** as the scoring function **to rank the features**. We are **selecting the top 10 features** based on this score (k=10). Once we fit the selector on the independent variables and target variable, we can get the indices and names of the selected features using the get_support and columns methods, respectively.

## updrs_1

In [18]:
from sklearn.feature_selection import SelectKBest, f_regression

# Select the top 10 features based on the F-test score.
selector1 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new1 = selector1.fit_transform(X_updrs1, y_updrs1)

# Get the indices of the selected features.
selected_indices1 = selector1.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs1 = X_updrs1.columns[selected_indices1]

In [19]:
# the selected features
X_new1

array([[    0.  , 60980.2 ,     0.  , ...,     0.  ,     0.  ,     0.  ],
       [    0.  , 52614.2 ,     0.  , ...,     0.  ,     0.  , 16311.6 ],
       [    0.  , 67865.  ,  6847.04, ...,     0.  ,     0.  , 26687.2 ],
       ...,
       [ 6778.22, 58520.2 , 18228.8 , ..., 10682.6 ,     0.  , 18745.8 ],
       [ 6251.34, 64102.3 , 16847.7 , ...,  9697.91,     0.  , 24418.9 ],
       [ 5988.12, 47542.1 , 17359.7 , ..., 15239.1 , 13383.  , 24243.2 ]])

In [20]:
# The selected features does not include visit_month.
selected_X_updrs1

Index(['EAEEETTNDNGVLVLEPARK', 'FFLC(UniMod_4)QVAGDAK', 'FIYGGC(UniMod_4)GGNR',
       'GATLALTQVTPQDER', 'GEAGAPGEEDIQGPTK', 'LDEVKEQVAEVR', 'QQETAAAETETR',
       'TLKIENVSYQDKGNYR', 'VGGVQSLGGTGALR',
       'VHKEDDGVPVIC(UniMod_4)QVEHPAVTGNLQTQR'],
      dtype='object')

In [21]:
from sklearn.feature_selection import SelectKBest, f_regression

# Select the top 10 features with the highest F-values.
selector1 = SelectKBest(f_regression, k = 10)
X_new1 = selector1.fit_transform(X_updrs1, y_updrs1)

# Add visit_month column to X_new.
X_new1 = np.column_stack((df_updrs1.iloc[:, 1].values, X_new1))

# Split the dataset into training and testing sets.
X_train_updrs1, X_test_updrs1, y_train_updrs1, y_test_updrs1 = train_test_split(X_new1, y_updrs1, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler1 = StandardScaler()
X_train_updrs1 = scaler1.fit_transform(X_train_updrs1)
X_test_updrs1 = scaler1.transform(X_test_updrs1)

# Fit a linear regression model on the training set.
model_updrs1 = LinearRegression()
model_updrs1.fit(X_train_updrs1, y_train_updrs1)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs1 = model_updrs1.predict(X_test_updrs1)
y_pred_updrs1 = np.where(y_pred_updrs1 < 0, 0, y_pred_updrs1)

# Evaluate the performance of the model.
mse_updrs1 = mean_squared_error(y_test_updrs1, y_pred_updrs1)
mae_updrs1 = mean_absolute_error(y_test_updrs1, y_pred_updrs1)
r2_updrs1 = r2_score(y_test_updrs1, y_pred_updrs1)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_1
print("mse_updrs1:", mse_updrs1)
print("mae_updrs1:", mae_updrs1)
print("r2_updrs1:", r2_updrs1)
print("SMAPE_updrs1:", smape(y_test_updrs1, y_pred_updrs1))

mse_updrs1: 22.05163758035551
mae_updrs1: 3.902137973228242
r2_updrs1: -0.002350147373334721
SMAPE_updrs1: 74.35286062767517


## updrs_2

In [22]:
# Select the top 10 features based on the F-test score.
selector2 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new2 = selector2.fit_transform(X_updrs2, y_updrs2)

# Get the indices of the selected features.
selected_indices2 = selector2.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs2 = X_updrs2.columns[selected_indices2]

In [23]:
# The selected features does not include visit_month.
selected_X_updrs2

Index(['AYQGVAAPFPK', 'DRLDEVKEQVAEVR', 'EAEEETTNDNGVLVLEPARK',
       'FIYGGC(UniMod_4)GGNR', 'GATLALTQVTPQDER', 'LDEVKEQVAEVR', 'LEEQAQQIR',
       'LQAEAFQAR', 'QQETAAAETETR', 'TLKIENVSYQDKGNYR'],
      dtype='object')

In [24]:
# Select the top 10 features with the highest F-values.
selector2 = SelectKBest(f_regression, k = 10)
X_new2 = selector2.fit_transform(X_updrs2, y_updrs2)

# Add visit_month column to X_new.
X_new2 = np.column_stack((df_updrs2.iloc[:, 1].values, X_new2))

# Split the dataset into training and testing sets.
X_train_updrs2, X_test_updrs2, y_train_updrs2, y_test_updrs2 = train_test_split(X_new2, y_updrs2, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler2 = StandardScaler()
X_train_updrs2 = scaler2.fit_transform(X_train_updrs2)
X_test_updrs2 = scaler2.transform(X_test_updrs2)

# Fit a linear regression model on the training set.
model_updrs2 = LinearRegression()
model_updrs2.fit(X_train_updrs2, y_train_updrs2)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs2 = model_updrs2.predict(X_test_updrs2)
y_pred_updrs2 = np.where(y_pred_updrs2 < 0, 0, y_pred_updrs2)

# Evaluate the performance of the model.
mse_updrs2 = mean_squared_error(y_test_updrs2, y_pred_updrs2)
mae_updrs2 = mean_absolute_error(y_test_updrs2, y_pred_updrs2)
r2_updrs2 = r2_score(y_test_updrs2, y_pred_updrs2)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_2
print("mse_updrs2:", mse_updrs2)
print("mae_updrs2:", mae_updrs2)
print("r2_updrs2:", r2_updrs2)
print("SMAPE_updrs2:", smape(y_test_updrs2, y_pred_updrs2))

mse_updrs2: 34.64765343973025
mae_updrs2: 4.738632282627398
r2_updrs2: 0.02669781718739339
SMAPE_updrs2: 101.98658497016912


## updrs_3

In [25]:
# Select the top 10 features based on the F-test score.
selector3 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new3 = selector3.fit_transform(X_updrs3, y_updrs3)

# Get the indices of the selected features.
selected_indices3 = selector3.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs3 = X_updrs3.columns[selected_indices3]

In [26]:
# The selected features does not include visit_month.
selected_X_updrs3

Index(['ALEYIENLR', 'AYQGVAAPFPK', 'FVEGLPINDFSR', 'IEIPSSVQQVPTIIK',
       'KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK', 'LVFFAEDVGSNK',
       'QQETAAAETETR', 'TLKIENVSYQDKGNYR', 'VNGSPVDNHPFAGDVVFPR',
       'VRQGQGQSEPGEYEQR'],
      dtype='object')

In [27]:
# Select the top 10 features with the highest F-values.
selector3 = SelectKBest(f_regression, k = 10)
X_new3 = selector3.fit_transform(X_updrs3, y_updrs3)

# Add visit_month column to X_new.
X_new3 = np.column_stack((df_updrs3.iloc[:, 1].values, X_new3))

# Split the dataset into training and testing sets.
X_train_updrs3, X_test_updrs3, y_train_updrs3, y_test_updrs3 = train_test_split(X_new3, y_updrs3, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler3 = StandardScaler()
X_train_updrs3 = scaler3.fit_transform(X_train_updrs3)
X_test_updrs3 = scaler3.transform(X_test_updrs3)

# Fit a linear regression model on the training set.
model_updrs3 = LinearRegression()
model_updrs3.fit(X_train_updrs3, y_train_updrs3)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs3 = model_updrs3.predict(X_test_updrs3)
y_pred_updrs3 = np.where(y_pred_updrs3 < 0, 0, y_pred_updrs3)

# Evaluate the performance of the model.
mse_updrs3 = mean_squared_error(y_test_updrs3, y_pred_updrs3)
mae_updrs3 = mean_absolute_error(y_test_updrs3, y_pred_updrs3)
r2_updrs3 = r2_score(y_test_updrs3, y_pred_updrs3)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_3
print("mse_updrs3:", mse_updrs3)
print("mae_updrs3:", mae_updrs3)
print("r2_updrs3:", r2_updrs3)
print("SMAPE_updrs3:", smape(y_test_updrs3, y_pred_updrs3))

mse_updrs3: 237.37987149995837
mae_updrs3: 12.838956355508547
r2_updrs3: 0.04460050457442155
SMAPE_updrs3: 96.61876410425539


## updrs_4

In [28]:
# Select the top 10 features based on the F-test score.
selector4 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new4 = selector4.fit_transform(X_updrs4, y_updrs4)

# Get the indices of the selected features.
selected_indices4 = selector4.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs4 = X_updrs4.columns[selected_indices4]

In [29]:
# The selected features does not include visit_month.
selected_X_updrs4

Index(['APLIPMEHC(UniMod_4)TTR', 'C(UniMod_4)AEENC(UniMod_4)FIQK',
       'C(UniMod_4)PFPSRPDNGFVNYPAKPTLYYK', 'EDC(UniMod_4)NELPPRR',
       'FSGSLLGGK', 'LDEVKEQVAEVR', 'LEPGQQEEYYR', 'LLELTGPK', 'SILENLR',
       'VLEPTLK'],
      dtype='object')

In [30]:
# Select the top 10 features with the highest F-values.
selector4 = SelectKBest(f_regression, k = 10)
X_new4 = selector4.fit_transform(X_updrs4, y_updrs4)

# Add visit_month column to X_new.
X_new4 = np.column_stack((df_updrs4.iloc[:, 1].values, X_new4))

# Split the dataset into training and testing sets.
X_train_updrs4, X_test_updrs4, y_train_updrs4, y_test_updrs4 = train_test_split(X_new4, y_updrs4, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler4 = StandardScaler()
X_train_updrs4 = scaler4.fit_transform(X_train_updrs4)
X_test_updrs4 = scaler4.transform(X_test_updrs4)

# Fit a linear regression model on the training set.
model_updrs4 = LinearRegression()
model_updrs4.fit(X_train_updrs4, y_train_updrs4)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs4 = model_updrs4.predict(X_test_updrs4)
y_pred_updrs4 = np.where(y_pred_updrs4 < 0, 0, y_pred_updrs4)

# Evaluate the performance of the model.
mse_updrs4 = mean_squared_error(y_test_updrs4, y_pred_updrs4)
mae_updrs4 = mean_absolute_error(y_test_updrs4, y_pred_updrs4)
r2_updrs4 = r2_score(y_test_updrs4, y_pred_updrs4)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_4
print("mse_updrs4:", mse_updrs4)
print("mae_updrs4:", mae_updrs4)
print("r2_updrs4:", r2_updrs4)
print("SMAPE_updrs4:", smape(y_test_updrs4, y_pred_updrs4))

mse_updrs4: 7.907121166475644
mae_updrs4: 2.1610961887250193
r2_updrs4: -0.07099549426797025
SMAPE_updrs4: 148.0046268161352


## Results

In [31]:
# Add a title to the DataFrame.
print("The Results with Univariate Feature Selection")

# Create a dictionary with the metrics for each target.
metrics_dict_KBest = {
    'Target': ['UPDRS 1', 'UPDRS 2', 'UPDRS 3', 'UPDRS 4'],
    'MSE': [mse_updrs1, mse_updrs2, mse_updrs3, mse_updrs4],
    'MAE': [mae_updrs1, mae_updrs2, mae_updrs3, mae_updrs4],
    'R2': [r2_updrs1, r2_updrs2, r2_updrs3, r2_updrs4],
    'SMAPE': [smape(y_test_updrs1, y_pred_updrs1), smape(y_test_updrs2, y_pred_updrs2), 
              smape(y_test_updrs3, y_pred_updrs3), smape(y_test_updrs4, y_pred_updrs4)]
}

# Create a Pandas DataFrame from the dictionary.
metrics_df_KBest = pd.DataFrame(metrics_dict_KBest)

# Set the 'Target' column as the index.
metrics_df_KBest.set_index('Target', inplace = True)

# Display the DataFrame.
metrics_df_KBest

The Results with Univariate Feature Selection


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,22.051638,3.902138,-0.00235,74.352861
UPDRS 2,34.647653,4.738632,0.026698,101.986585
UPDRS 3,237.379871,12.838956,0.044601,96.618764
UPDRS 4,7.907121,2.161096,-0.070995,148.004627


In [32]:
# Add a title to the DataFrame.
print("The Results without Features Selection")

# comparison with the results without features selection
metrics_df_all

The Results without Features Selection


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,181.362567,8.976865,-7.243778,123.654261
UPDRS 2,141.061387,7.715337,-2.962616,119.040559
UPDRS 3,886.637958,21.082452,-2.568514,115.795856
UPDRS 4,11.084991,2.387561,-0.501428,122.636215


**We could see better results except for SMAPE for updrs_4 (y_4) by Univariate Feature Selection than those without features selection.**

# Linear Regression Model with Recursive Feature Elimination (RFE)

Recursive Feature Elimination (RFE) is a method to **select the best features by recursively considering smaller and smaller subsets of features**. In each iteration, the model is trained on the remaining features and the feature **with the lowest importance is removed**.

To add RFE to the linear regression model, we can use **the RFE class from scikit-learn**.

Here, **n_features_to_select is the number of features to select** and **step is the number of features to remove at each iteration**. The selector.transform method selects only the selected features from the training and testing data, and the linear regression model is fit on the selected features. Finally, the performance of the model is evaluated on the selected features.

## updrs_1

In [33]:
from sklearn.feature_selection import RFE

# Create an instance of the linear regression model.
model_updrs1 = LinearRegression()

# Create an instance of the RFE class and set the number of features to select.
selector1 = RFE(model_updrs1, n_features_to_select = 5, step = 1)

# Fit the selector on the training data.
selector1.fit(X_train_updrs1, y_train_updrs1)

# Transform the training and testing data to include only the selected features.
X_train_selected1 = selector1.transform(X_train_updrs1)
X_test_selected1 = selector1.transform(X_test_updrs1)

# Fit the linear regression model on the selected features.
model_updrs1.fit(X_train_selected1, y_train_updrs1)

# Predict the values of the dependent variable (target) on the testing set using the selected features.
y_pred_updrs1 = model_updrs1.predict(X_test_selected1)
y_pred_updrs1 = np.where(y_pred_updrs1 < 0, 0, y_pred_updrs1)

# Evaluate the performance of the model on the selected features.
mse_updrs1 = mean_squared_error(y_test_updrs1, y_pred_updrs1)
mae_updrs1 = mean_absolute_error(y_test_updrs1, y_pred_updrs1)
r2_updrs1 = r2_score(y_test_updrs1, y_pred_updrs1)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_1
print("mse_updrs1:", mse_updrs1)
print("mae_updrs1:", mae_updrs1)
print("r2_updrs1:", r2_updrs1)
print("SMAPE_updrs1:", smape(y_test_updrs1, y_pred_updrs1))

mse_updrs1: 21.97704537475717
mae_updrs1: 3.894908743202316
r2_updrs1: 0.0010404175224445478
SMAPE_updrs1: 74.14369552761725


## updrs_2

In [34]:
# Create an instance of the linear regression model.
model_updrs2 = LinearRegression()

# Create an instance of the RFE class and set the number of features to select.
selector2 = RFE(model_updrs2, n_features_to_select = 5, step = 1)

# Fit the selector on the training data.
selector2.fit(X_train_updrs2, y_train_updrs2)

# Transform the training and testing data to include only the selected features.
X_train_selected2 = selector2.transform(X_train_updrs2)
X_test_selected2 = selector2.transform(X_test_updrs2)

# Fit the linear regression model on the selected features.
model_updrs2.fit(X_train_selected2, y_train_updrs2)

# Predict the values of the dependent variable (target) on the testing set using the selected features.
y_pred_updrs2 = model_updrs2.predict(X_test_selected2)
y_pred_updrs2 = np.where(y_pred_updrs2 < 0, 0, y_pred_updrs2)

# Evaluate the performance of the model on the selected features.
mse_updrs2 = mean_squared_error(y_test_updrs2, y_pred_updrs2)
mae_updrs2 = mean_absolute_error(y_test_updrs2, y_pred_updrs2)
r2_updrs2 = r2_score(y_test_updrs2, y_pred_updrs2)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_2
print("mse_updrs2:", mse_updrs2)
print("mae_updrs2:", mae_updrs2)
print("r2_updrs2:", r2_updrs2)
print("SMAPE_updrs2:", smape(y_test_updrs2, y_pred_updrs2))

mse_updrs2: 35.088510417511216
mae_updrs2: 4.768767514592333
r2_updrs2: 0.014313513600174077
SMAPE_updrs2: 102.20941995373704


## updrs_3

In [35]:
# Create an instance of the linear regression model.
model_updrs3 = LinearRegression()

# Create an instance of the RFE class and set the number of features to select.
selector3 = RFE(model_updrs3, n_features_to_select = 5, step = 1)

# Fit the selector on the training data.
selector3.fit(X_train_updrs3, y_train_updrs3)

# Transform the training and testing data to include only the selected features.
X_train_selected3 = selector3.transform(X_train_updrs3)
X_test_selected3 = selector3.transform(X_test_updrs3)

# Fit the linear regression model on the selected features.
model_updrs3.fit(X_train_selected3, y_train_updrs3)

# Predict the values of the dependent variable (target) on the testing set using the selected features.
y_pred_updrs3 = model_updrs3.predict(X_test_selected3)
y_pred_updrs3 = np.where(y_pred_updrs3 < 0, 0, y_pred_updrs3)

# Evaluate the performance of the model on the selected features.
mse_updrs3 = mean_squared_error(y_test_updrs3, y_pred_updrs3)
mae_updrs3 = mean_absolute_error(y_test_updrs3, y_pred_updrs3)
r2_updrs3 = r2_score(y_test_updrs3, y_pred_updrs3)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_3
print("mse_updrs3:", mse_updrs3)
print("mae_updrs3:", mae_updrs3)
print("r2_updrs3:", r2_updrs3)
print("SMAPE_updrs3:", smape(y_test_updrs3, y_pred_updrs3))

mse_updrs3: 235.83013851462368
mae_updrs3: 12.762254866251473
r2_updrs3: 0.050837824119998154
SMAPE_updrs3: 96.05947458921348


## updrs_4

In [36]:
# Create an instance of the linear regression model.
model_updrs4 = LinearRegression()

# Create an instance of the RFE class and set the number of features to select.
selector4 = RFE(model_updrs4, n_features_to_select = 5, step = 1)

# Fit the selector on the training data.
selector4.fit(X_train_updrs4, y_train_updrs4)

# Transform the training and testing data to include only the selected features.
X_train_selected4 = selector4.transform(X_train_updrs4)
X_test_selected4 = selector4.transform(X_test_updrs4)

# Fit the linear regression model on the selected features.
model_updrs4.fit(X_train_selected4, y_train_updrs4)

# Predict the values of the dependent variable (target) on the testing set using the selected features.
y_pred_updrs4 = model_updrs4.predict(X_test_selected4)
y_pred_updrs4 = np.where(y_pred_updrs4 < 0, 0, y_pred_updrs4)

# Evaluate the performance of the model on the selected features.
mse_updrs4 = mean_squared_error(y_test_updrs4, y_pred_updrs4)
mae_updrs4 = mean_absolute_error(y_test_updrs4, y_pred_updrs4)
r2_updrs4 = r2_score(y_test_updrs4, y_pred_updrs4)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_4
print("mse_updrs4:", mse_updrs4)
print("mae_updrs4:", mae_updrs4)
print("r2_updrs4:", r2_updrs4)
print("SMAPE_updrs4:", smape(y_test_updrs4, y_pred_updrs4))

mse_updrs4: 7.948160754716737
mae_updrs4: 2.158215094416796
r2_updrs4: -0.07655418157874205
SMAPE_updrs4: 148.73464823635248


## Results

In [37]:
# Add a title to the DataFrame.
print("The Results with Recursive Feature Elimination")

# Create a dictionary with the metrics for each target.
metrics_dict_RFE = {
    'Target': ['UPDRS 1', 'UPDRS 2', 'UPDRS 3', 'UPDRS 4'],
    'MSE': [mse_updrs1, mse_updrs2, mse_updrs3, mse_updrs4],
    'MAE': [mae_updrs1, mae_updrs2, mae_updrs3, mae_updrs4],
    'R2': [r2_updrs1, r2_updrs2, r2_updrs3, r2_updrs4],
    'SMAPE': [smape(y_test_updrs1, y_pred_updrs1), smape(y_test_updrs2, y_pred_updrs2), 
              smape(y_test_updrs3, y_pred_updrs3), smape(y_test_updrs4, y_pred_updrs4)]
}

# Create a Pandas DataFrame from the dictionary.
metrics_df_RFE = pd.DataFrame(metrics_dict_RFE)

# Set the 'Target' column as the index.
metrics_df_RFE.set_index('Target', inplace = True)

# Display the DataFrame.
metrics_df_RFE

The Results with Recursive Feature Elimination


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,21.977045,3.894909,0.00104,74.143696
UPDRS 2,35.08851,4.768768,0.014314,102.20942
UPDRS 3,235.830139,12.762255,0.050838,96.059475
UPDRS 4,7.948161,2.158215,-0.076554,148.734648


In [38]:
# Add a title to the DataFrame.
print("The Results without Features Selection")

# comparison with the results without features selection
metrics_df_all

The Results without Features Selection


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,181.362567,8.976865,-7.243778,123.654261
UPDRS 2,141.061387,7.715337,-2.962616,119.040559
UPDRS 3,886.637958,21.082452,-2.568514,115.795856
UPDRS 4,11.084991,2.387561,-0.501428,122.636215


**We could see better results except for SMAPE for updrs_4 (y_4) by Univariate Feature Selection than those without features selection.** In addition, the results are **similar to those of Univariate Feature selection**.

Contrary to Univariate Feature Selection, **this RFE method does not guarantee to keep a specific variable, such as visit_month column**. If this variable is eliminated, **the prediction will be the same regardless of visit_month.**

# Linear Regression Model with Principal Component Analysis (PCA)

To add Principal Component Analysis (PCA), we can use **the PCA class from the sklearn.decomposition module**. Here, we **add PCA to Univariate Feature Selection**.

We first **apply PCA to reduce the dimensionality of the data to 50 components**, and then **select the top 10 features with the highest F-values** from the PCA-transformed data. **The rest of the code remains the same.** Note that we may need to **experiment with different values of n_components to find the optimal number of components** to use.

## updrs_1

In [39]:
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_regression

# Separate the dataset for updrs_1.
df_updrs1 = df[['updrs_1'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs1 = df_updrs1.iloc[:, 1:]
y_updrs1 = df_updrs1.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs1 = pca.fit_transform(X_updrs1)

# Select the top 10 features based on the F-test score.
selector1 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new1 = selector1.fit_transform(X_updrs1, y_updrs1)

# Get the indices of the selected features.
selected_indices1 = selector1.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs1 = df.columns[4:][selected_indices1]

In [40]:
# the selected features
X_new1

array([[-2.98551712e+06,  1.37471114e+07, -5.59155966e+05, ...,
        -8.12973474e+05, -3.76999095e+05, -3.55283655e+05],
       [-7.90717211e+06,  6.40671120e+06,  1.99162746e+06, ...,
        -2.57782203e+06, -7.96357004e+04, -2.72181128e+05],
       [-2.39707867e+06,  1.08494888e+05, -1.66312245e+06, ...,
        -1.49739518e+06, -2.82346311e+05,  6.93265029e+05],
       ...,
       [-3.44761598e+06, -2.81951556e+06,  7.47572913e+05, ...,
        -8.11629753e+05,  6.90786952e+05, -7.22631582e+05],
       [-3.50849222e+06, -4.61604150e+06,  8.78829064e+05, ...,
        -4.34797219e+05,  1.59489259e+05, -1.03602118e+06],
       [-4.48664213e+06,  1.17638299e+04, -1.46451375e+05, ...,
         1.40331408e+06,  4.41626908e+05, -7.84089475e+05]])

In [41]:
# The selected features does not include visit_month.
selected_X_updrs1

Index(['AADDTWEPFASGK', 'ADDKETC(UniMod_4)FAEEGK', 'ADDKETC(UniMod_4)FAEEGKK',
       'AEFAEVSK', 'AESPEVC(UniMod_4)FNEESPK', 'AGALNSNDAFVLK',
       'AGDFLEANYMNLQR', 'AGLAASLAGPHSIVGR', 'AIPVTQYLK', 'ALTDMPQMR'],
      dtype='object')

In [42]:
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.decomposition import PCA

# Separate the dataset for updrs_1.
df_updrs1 = df[['updrs_1'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs1 = df_updrs1.iloc[:, 1:]
y_updrs1 = df_updrs1.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs1 = pca.fit_transform(X_updrs1)

# Select the top 10 features with the highest F-values.
selector1 = SelectKBest(f_regression, k = 10)
X_new1 = selector1.fit_transform(X_updrs1, y_updrs1)

# Add visit_month column to X_new.
X_new1 = np.column_stack((df_updrs1.iloc[:, 1].values, X_new1))

# Split the dataset into training and testing sets.
X_train_updrs1, X_test_updrs1, y_train_updrs1, y_test_updrs1 = train_test_split(X_new1, y_updrs1, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler1 = StandardScaler()
X_train_updrs1 = scaler1.fit_transform(X_train_updrs1)
X_test_updrs1 = scaler1.transform(X_test_updrs1)

# Fit a linear regression model on the training set.
model_updrs1 = LinearRegression()
model_updrs1.fit(X_train_updrs1, y_train_updrs1)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs1 = model_updrs1.predict(X_test_updrs1)
y_pred_updrs1 = np.where(y_pred_updrs1 < 0, 0, y_pred_updrs1)

# Evaluate the performance of the model.
mse_updrs1 = mean_squared_error(y_test_updrs1, y_pred_updrs1)
mae_updrs1 = mean_absolute_error(y_test_updrs1, y_pred_updrs1)
r2_updrs1 = r2_score(y_test_updrs1, y_pred_updrs1)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_1
print("mse_updrs1:", mse_updrs1)
print("mae_updrs1:", mae_updrs1)
print("r2_updrs1:", r2_updrs1)
print("SMAPE_updrs1:", smape(y_test_updrs1, y_pred_updrs1))

mse_updrs1: 22.678155388785267
mae_updrs1: 3.9746039032854537
r2_updrs1: -0.030828314372189247
SMAPE_updrs1: 75.1047625248361


## updrs_2

In [43]:
# Separate the dataset for updrs_2.
df_updrs2 = df[['updrs_2'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs2 = df_updrs2.iloc[:, 1:]
y_updrs2 = df_updrs2.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs2 = pca.fit_transform(X_updrs2)

# Select the top 10 features based on the F-test score.
selector2 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new2 = selector2.fit_transform(X_updrs2, y_updrs2)

# Get the indices of the selected features.
selected_indices2 = selector2.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs2 = df.columns[4:][selected_indices2]

In [44]:
# The selected features does not include visit_month.
selected_X_updrs2

Index(['AADDTWEPFASGK', 'AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K',
       'AANEVSSADVK', 'ADDKETC(UniMod_4)FAEEGK', 'AEFAEVSK',
       'AESPEVC(UniMod_4)FNEESPK', 'AGDFLEANYMNLQR', 'AGLQVYNK',
       'AIQLTYNPDESSKPNMIDAATLK', 'AKAYLEEEC(UniMod_4)PATLRK'],
      dtype='object')

In [45]:
# Separate the dataset for updrs_2.
df_updrs2 = df[['updrs_2'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs2 = df_updrs2.iloc[:, 1:]
y_updrs2 = df_updrs2.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs2 = pca.fit_transform(X_updrs2)

# Select the top 10 features with the highest F-values.
selector2 = SelectKBest(f_regression, k = 10)
X_new2 = selector2.fit_transform(X_updrs2, y_updrs2)

# Add visit_month column to X_new.
X_new2 = np.column_stack((df_updrs2.iloc[:, 1].values, X_new2))

# Split the dataset into training and testing sets.
X_train_updrs2, X_test_updrs2, y_train_updrs2, y_test_updrs2 = train_test_split(X_new2, y_updrs2, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler2 = StandardScaler()
X_train_updrs2 = scaler2.fit_transform(X_train_updrs2)
X_test_updrs2 = scaler2.transform(X_test_updrs2)

# Fit a linear regression model on the training set.
model_updrs2 = LinearRegression()
model_updrs2.fit(X_train_updrs2, y_train_updrs2)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs2 = model_updrs2.predict(X_test_updrs2)
y_pred_updrs2 = np.where(y_pred_updrs2 < 0, 0, y_pred_updrs2)

# Evaluate the performance of the model.
mse_updrs2 = mean_squared_error(y_test_updrs2, y_pred_updrs2)
mae_updrs2 = mean_absolute_error(y_test_updrs2, y_pred_updrs2)
r2_updrs2 = r2_score(y_test_updrs2, y_pred_updrs2)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_2
print("mse_updrs2:", mse_updrs2)
print("mae_updrs2:", mae_updrs2)
print("r2_updrs2:", r2_updrs2)
print("SMAPE_updrs2:", smape(y_test_updrs2, y_pred_updrs2))

mse_updrs2: 34.69991860609635
mae_updrs2: 4.727209100169103
r2_updrs2: 0.02522961384722555
SMAPE_updrs2: 103.52407343203258


## updrs_3

In [46]:
# Separate the dataset for updrs_3.
df_updrs3 = df[['updrs_3'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs3 = df_updrs3.iloc[:, 1:]
y_updrs3 = df_updrs3.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs3 = pca.fit_transform(X_updrs3)

# Select the top 10 features based on the F-test score.
selector3 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new3 = selector3.fit_transform(X_updrs3, y_updrs3)

# Get the indices of the selected features.
selected_indices3 = selector3.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs3 = df.columns[4:][selected_indices3]

In [47]:
# The selected features does not include visit_month.
selected_X_updrs3

Index(['AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K',
       'ADDKETC(UniMod_4)FAEEGK', 'ADRDQYELLC(UniMod_4)LDNTR', 'AEFAEVSK',
       'AGC(UniMod_4)VAESTAVC(UniMod_4)R', 'AGDFLEANYMNLQR',
       'AIGAVPLIQGEYMIPC(UniMod_4)EK', 'AKLEEQAQQIR', 'ALFLETEQLK',
       'ALMSPAGMLR'],
      dtype='object')

In [48]:
# Separate the dataset for updrs_3.
df_updrs3 = df[['updrs_3'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs3 = df_updrs3.iloc[:, 1:]
y_updrs3 = df_updrs3.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs3 = pca.fit_transform(X_updrs3)

# Select the top 10 features with the highest F-values.
selector3 = SelectKBest(f_regression, k = 10)
X_new3 = selector3.fit_transform(X_updrs3, y_updrs3)

# Add visit_month column to X_new.
X_new3 = np.column_stack((df_updrs3.iloc[:, 1].values, X_new3))

# Split the dataset into training and testing sets.
X_train_updrs3, X_test_updrs3, y_train_updrs3, y_test_updrs3 = train_test_split(X_new3, y_updrs3, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler3 = StandardScaler()
X_train_updrs3 = scaler3.fit_transform(X_train_updrs3)
X_test_updrs3 = scaler3.transform(X_test_updrs3)

# Fit a linear regression model on the training set.
model_updrs3 = LinearRegression()
model_updrs3.fit(X_train_updrs3, y_train_updrs3)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs3 = model_updrs3.predict(X_test_updrs3)
y_pred_updrs3 = np.where(y_pred_updrs3 < 0, 0, y_pred_updrs3)

# Evaluate the performance of the model.
mse_updrs3 = mean_squared_error(y_test_updrs3, y_pred_updrs3)
mae_updrs3 = mean_absolute_error(y_test_updrs3, y_pred_updrs3)
r2_updrs3 = r2_score(y_test_updrs3, y_pred_updrs3)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_3
print("mse_updrs3:", mse_updrs3)
print("mae_updrs3:", mae_updrs3)
print("r2_updrs3:", r2_updrs3)
print("SMAPE_updrs3:", smape(y_test_updrs3, y_pred_updrs3))

mse_updrs3: 244.65974825216574
mae_updrs3: 13.018245367937
r2_updrs3: 0.015300671644734698
SMAPE_updrs3: 96.21592877581251


## updrs_4

In [49]:
# Separate the dataset for updrs_4.
df_updrs4 = df[['updrs_4'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs4 = df_updrs4.iloc[:, 1:]
y_updrs4 = df_updrs4.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs4 = pca.fit_transform(X_updrs4)

# Select the top 10 features based on the F-test score.
selector4 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new4 = selector4.fit_transform(X_updrs4, y_updrs4)

# Get the indices of the selected features.
selected_indices4 = selector4.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs4 = df.columns[4:][selected_indices4]

In [50]:
# The selected features does not include visit_month.
selected_X_updrs4

Index(['AAFTEC(UniMod_4)C(UniMod_4)QAADK', 'AATGEC(UniMod_4)TATVGKR',
       'ADSGEGDFLAEGGGVR', 'AELQC(UniMod_4)PQPAA', 'AFPALTSLDLSDNPGLGER',
       'AGDFLEANYMNLQR', 'AIGYLNTGYQR', 'AKPALEDLR', 'ALFLETEQLK',
       'ALQDQLVLVAAK'],
      dtype='object')

In [51]:
# Separate the dataset for updrs_4.
df_updrs4 = df[['updrs_4'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs4 = df_updrs4.iloc[:, 1:]
y_updrs4 = df_updrs4.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs4 = pca.fit_transform(X_updrs4)

# Select the top 10 features with the highest F-values.
selector4 = SelectKBest(f_regression, k = 10)
X_new4 = selector4.fit_transform(X_updrs4, y_updrs4)

# Add visit_month column to X_new.
X_new4 = np.column_stack((df_updrs4.iloc[:, 1].values, X_new4))

# Split the dataset into training and testing sets.
X_train_updrs4, X_test_updrs4, y_train_updrs4, y_test_updrs4 = train_test_split(X_new4, y_updrs4, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler4 = StandardScaler()
X_train_updrs4 = scaler4.fit_transform(X_train_updrs4)
X_test_updrs4 = scaler4.transform(X_test_updrs4)

# Fit a linear regression model on the training set.
model_updrs4 = LinearRegression()
model_updrs4.fit(X_train_updrs4, y_train_updrs4)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs4 = model_updrs4.predict(X_test_updrs4)
y_pred_updrs4 = np.where(y_pred_updrs4 < 0, 0, y_pred_updrs4)

# Evaluate the performance of the model.
mse_updrs4 = mean_squared_error(y_test_updrs4, y_pred_updrs4)
mae_updrs4 = mean_absolute_error(y_test_updrs4, y_pred_updrs4)
r2_updrs4 = r2_score(y_test_updrs4, y_pred_updrs4)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_4
print("mse_updrs4:", mse_updrs4)
print("mae_updrs4:", mae_updrs4)
print("r2_updrs4:", r2_updrs4)
print("SMAPE_updrs4:", smape(y_test_updrs4, y_pred_updrs4))

mse_updrs4: 6.716864582898038
mae_updrs4: 2.092564309070689
r2_updrs4: 0.09022113706924617
SMAPE_updrs4: 148.47692522375579


## Results

In [52]:
# Add a title to the DataFrame.
print("The Results with Univariate Feature Selection and PCA")

# Create a dictionary with the metrics for each target.
metrics_dict_PCA = {
    'Target': ['UPDRS 1', 'UPDRS 2', 'UPDRS 3', 'UPDRS 4'],
    'MSE': [mse_updrs1, mse_updrs2, mse_updrs3, mse_updrs4],
    'MAE': [mae_updrs1, mae_updrs2, mae_updrs3, mae_updrs4],
    'R2': [r2_updrs1, r2_updrs2, r2_updrs3, r2_updrs4],
    'SMAPE': [smape(y_test_updrs1, y_pred_updrs1), smape(y_test_updrs2, y_pred_updrs2), 
              smape(y_test_updrs3, y_pred_updrs3), smape(y_test_updrs4, y_pred_updrs4)]
}

# Create a Pandas DataFrame from the dictionary.
metrics_df_PCA = pd.DataFrame(metrics_dict_PCA)

# Set the 'Target' column as the index.
metrics_df_PCA.set_index('Target', inplace = True)

# Display the DataFrame.
metrics_df_PCA

The Results with Univariate Feature Selection and PCA


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,22.678155,3.974604,-0.030828,75.104763
UPDRS 2,34.699919,4.727209,0.02523,103.524073
UPDRS 3,244.659748,13.018245,0.015301,96.215929
UPDRS 4,6.716865,2.092564,0.090221,148.476925


In [53]:
# Add a title to the DataFrame.
print("The Results without Features Selection")

# comparison with the results without features selection
metrics_df_all

The Results without Features Selection


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,181.362567,8.976865,-7.243778,123.654261
UPDRS 2,141.061387,7.715337,-2.962616,119.040559
UPDRS 3,886.637958,21.082452,-2.568514,115.795856
UPDRS 4,11.084991,2.387561,-0.501428,122.636215


**We could see better results except for SMAPE for updrs_4 (y_4) by PCA and Univariate Feature Selection than those without features selection.** In addition, the results are **similar to those of Univariate Feature selection**.

# Lasso and Ridge Regression Model with Univariate Feature Selection and PCA

Further, we **add Lasso and Ridge regression to the previous model with Univariate Feature Selection and PCA**.

## updrs_1

In [54]:
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import Lasso, Ridge

# Separate the dataset for updrs_1.
df_updrs1 = df[['updrs_1'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs1 = df_updrs1.iloc[:, 1:]
y_updrs1 = df_updrs1.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs1 = pca.fit_transform(X_updrs1)

# Select the top 10 features based on the F-test score.
selector1 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new1 = selector1.fit_transform(X_updrs1, y_updrs1)

# Get the indices of the selected features.
selected_indices1 = selector1.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs1 = df.columns[4:][selected_indices1]

# Split the dataset into training and testing sets.
X_train_updrs1, X_test_updrs1, y_train_updrs1, y_test_updrs1 = train_test_split(X_new1, y_updrs1, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler1 = StandardScaler()
X_train_updrs1 = scaler1.fit_transform(X_train_updrs1)
X_test_updrs1 = scaler1.transform(X_test_updrs1)

# Fit a Lasso regression model on the training set.
lasso = Lasso(alpha = 0.1)
lasso.fit(X_train_updrs1, y_train_updrs1)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs1_lasso = lasso.predict(X_test_updrs1)
y_pred_updrs1_lasso = np.where(y_pred_updrs1_lasso < 0, 0, y_pred_updrs1_lasso)

# Evaluate the performance of the Lasso model.
mse_updrs1_lasso = mean_squared_error(y_test_updrs1, y_pred_updrs1_lasso)
mae_updrs1_lasso = mean_absolute_error(y_test_updrs1, y_pred_updrs1_lasso)
r2_updrs1_lasso = r2_score(y_test_updrs1, y_pred_updrs1_lasso)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_1
print("mse_updrs1:", mse_updrs1_lasso)
print("mae_updrs1:", mae_updrs1_lasso)
print("r2_updrs1:", r2_updrs1_lasso)
print("SMAPE_updrs1:", smape(y_test_updrs1, y_pred_updrs1_lasso))

# Fit a Ridge regression model on the training set.
ridge = Ridge(alpha = 0.1)
ridge.fit(X_train_updrs1, y_train_updrs1)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs1_ridge = ridge.predict(X_test_updrs1)
y_pred_updrs1_ridge = np.where(y_pred_updrs1_ridge < 0, 0, y_pred_updrs1_ridge)

# Evaluate the performance of the Ridge model.
mse_updrs1_ridge = mean_squared_error(y_test_updrs1, y_pred_updrs1_ridge)
mae_updrs1_ridge = mean_absolute_error(y_test_updrs1, y_pred_updrs1_ridge)
r2_updrs1_ridge = r2_score(y_test_updrs1, y_pred_updrs1)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_1
print("mse_updrs1:", mse_updrs1_ridge)
print("mae_updrs1:", mae_updrs1_ridge)
print("r2_updrs1:", r2_updrs1_ridge)
print("SMAPE_updrs1:", smape(y_test_updrs1, y_pred_updrs1_ridge))                           

mse_updrs1: 23.24928799252412
mae_updrs1: 4.016411357537079
r2_updrs1: -0.056788964570673395
SMAPE_updrs1: 75.60866641502766
mse_updrs1: 23.354055107777533
mae_updrs1: 4.016330606887107
r2_updrs1: -0.030828314372189247
SMAPE_updrs1: 75.59927970888366


## updrs_2

In [55]:
# Separate the dataset for updrs_2.
df_updrs2 = df[['updrs_2'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs2 = df_updrs2.iloc[:, 1:]
y_updrs2 = df_updrs2.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs2 = pca.fit_transform(X_updrs2)

# Select the top 10 features based on the F-test score.
selector2 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new2 = selector2.fit_transform(X_updrs2, y_updrs2)

# Get the indices of the selected features.
selected_indices2 = selector2.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs2 = df.columns[4:][selected_indices2]

# Split the dataset into training and testing sets.
X_train_updrs2, X_test_updrs2, y_train_updrs2, y_test_updrs2 = train_test_split(X_new2, y_updrs2, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler2 = StandardScaler()
X_train_updrs2 = scaler2.fit_transform(X_train_updrs2)
X_test_updrs2 = scaler2.transform(X_test_updrs2)

# Fit a Lasso regression model on the training set.
lasso = Lasso(alpha = 0.1)
lasso.fit(X_train_updrs2, y_train_updrs2)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs2_lasso = lasso.predict(X_test_updrs2)
y_pred_updrs2_lasso = np.where(y_pred_updrs2_lasso < 0, 0, y_pred_updrs2_lasso)

# Evaluate the performance of the Lasso model.
mse_updrs2_lasso = mean_squared_error(y_test_updrs2, y_pred_updrs2_lasso)
mae_updrs2_lasso = mean_absolute_error(y_test_updrs2, y_pred_updrs2_lasso)
r2_updrs2_lasso = r2_score(y_test_updrs2, y_pred_updrs2_lasso)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_2
print("mse_updrs2:", mse_updrs2_lasso)
print("mae_updrs2:", mae_updrs2_lasso)
print("r2_updrs2:", r2_updrs2_lasso)
print("SMAPE_updrs2:", smape(y_test_updrs2, y_pred_updrs2_lasso))  

# Fit a Ridge regression model on the training set.
ridge = Ridge(alpha = 0.1)
ridge.fit(X_train_updrs2, y_train_updrs2)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs2_ridge = ridge.predict(X_test_updrs2)
y_pred_updrs2_ridge = np.where(y_pred_updrs2_ridge < 0, 0, y_pred_updrs2_ridge)

# Evaluate the performance of the Ridge model.
mse_updrs2_ridge = mean_squared_error(y_test_updrs2, y_pred_updrs2_ridge)
mae_updrs2_ridge = mean_absolute_error(y_test_updrs2, y_pred_updrs2_ridge)
r2_updrs2_ridge = r2_score(y_test_updrs2, y_pred_updrs2_ridge)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_2
print("mse_updrs2:", mse_updrs2_ridge)
print("mae_updrs2:", mae_updrs2_ridge)
print("r2_updrs2:", r2_updrs2_ridge)
print("SMAPE_updrs2:", smape(y_test_updrs2, y_pred_updrs2_ridge))  

mse_updrs2: 35.15637331849116
mae_updrs2: 4.766655577303434
r2_updrs2: 0.012407147566737664
SMAPE_updrs2: 103.80544153636104
mse_updrs2: 35.17775678939786
mae_updrs2: 4.756698447714232
r2_updrs2: 0.011806455258792425
SMAPE_updrs2: 104.20959918539184


## updrs_3

In [56]:
# Separate the dataset for updrs_3.
df_updrs3 = df[['updrs_3'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs3 = df_updrs3.iloc[:, 1:]
y_updrs3 = df_updrs3.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs3 = pca.fit_transform(X_updrs3)

# Select the top 10 features based on the F-test score.
selector3 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new3 = selector3.fit_transform(X_updrs3, y_updrs3)

# Get the indices of the selected features.
selected_indices3 = selector3.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs3 = df.columns[4:][selected_indices3]

# Split the dataset into training and testing sets.
X_train_updrs3, X_test_updrs3, y_train_updrs3, y_test_updrs3 = train_test_split(X_new3, y_updrs3, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler3 = StandardScaler()
X_train_updrs3 = scaler3.fit_transform(X_train_updrs3)
X_test_updrs3 = scaler3.transform(X_test_updrs3)

# Fit a Lasso regression model on the training set.
lasso = Lasso(alpha = 0.1)
lasso.fit(X_train_updrs3, y_train_updrs3)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs3_lasso = lasso.predict(X_test_updrs3)
y_pred_updrs3_lasso = np.where(y_pred_updrs3_lasso < 0, 0, y_pred_updrs3_lasso)

# Evaluate the performance of the Lasso model.
mse_updrs3_lasso = mean_squared_error(y_test_updrs3, y_pred_updrs3_lasso)
mae_updrs3_lasso = mean_absolute_error(y_test_updrs3, y_pred_updrs3_lasso)
r2_updrs3_lasso = r2_score(y_test_updrs3, y_pred_updrs3_lasso)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_3
print("mse_updrs3:", mse_updrs3_lasso)
print("mae_updrs3:", mae_updrs3_lasso)
print("r2_updrs3:", r2_updrs3_lasso)
print("SMAPE_updrs3:", smape(y_test_updrs3, y_pred_updrs3_lasso))

# Fit a Ridge regression model on the training set.
ridge = Ridge(alpha = 0.1)
ridge.fit(X_train_updrs3, y_train_updrs3)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs3_ridge = ridge.predict(X_test_updrs3)
y_pred_updrs3_ridge = np.where(y_pred_updrs3_ridge < 0, 0, y_pred_updrs3_ridge)

# Evaluate the performance of the Ridge model.
mse_updrs3_ridge = mean_squared_error(y_test_updrs3, y_pred_updrs3_ridge)
mae_updrs3_ridge = mean_absolute_error(y_test_updrs3, y_pred_updrs3_ridge)
r2_updrs3_ridge = r2_score(y_test_updrs3, y_pred_updrs3_ridge)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_3
print("mse_updrs3:", mse_updrs3_ridge)
print("mae_updrs3:", mae_updrs3_ridge)
print("r2_updrs3:", r2_updrs3_ridge)
print("SMAPE_updrs3:", smape(y_test_updrs3, y_pred_updrs3_ridge))

mse_updrs3: 243.84529565483177
mae_updrs3: 13.017686633474113
r2_updrs3: 0.01857865640234735
SMAPE_updrs3: 96.44948293851392
mse_updrs3: 244.08832680926974
mae_updrs3: 13.014122339503498
r2_updrs3: 0.017600511790272333
SMAPE_updrs3: 96.53464273206602


## updrs_4

In [57]:
# Separate the dataset for updrs_4.
df_updrs4 = df[['updrs_4'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs4 = df_updrs4.iloc[:, 1:]
y_updrs4 = df_updrs4.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs4 = pca.fit_transform(X_updrs4)

# Select the top 10 features based on the F-test score.
selector4 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new4 = selector4.fit_transform(X_updrs4, y_updrs4)

# Get the indices of the selected features.
selected_indices4 = selector4.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs4 = df.columns[4:][selected_indices4]

# Split the dataset into training and testing sets.
X_train_updrs4, X_test_updrs4, y_train_updrs4, y_test_updrs4 = train_test_split(X_new4, y_updrs4, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler4 = StandardScaler()
X_train_updrs4 = scaler4.fit_transform(X_train_updrs4)
X_test_updrs4 = scaler4.transform(X_test_updrs4)

# Fit a Lasso regression model on the training set.
lasso = Lasso(alpha = 0.1)
lasso.fit(X_train_updrs4, y_train_updrs4)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs4_lasso = lasso.predict(X_test_updrs4)
y_pred_updrs4_lasso = np.where(y_pred_updrs4_lasso < 0, 0, y_pred_updrs4_lasso)

# Evaluate the performance of the Lasso model.
mse_updrs4_lasso = mean_squared_error(y_test_updrs4, y_pred_updrs4_lasso)
mae_updrs4_lasso = mean_absolute_error(y_test_updrs4, y_pred_updrs4_lasso)
r2_updrs4_lasso = r2_score(y_test_updrs4, y_pred_updrs4_lasso)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_4
print("mse_updrs4:", mse_updrs4_lasso)
print("mae_updrs4:", mae_updrs4_lasso)
print("r2_updrs4:", r2_updrs4_lasso)
print("SMAPE_updrs4:", smape(y_test_updrs4, y_pred_updrs4_lasso))

# Fit a Ridge regression model on the training set.
ridge = Ridge(alpha = 0.1)
ridge.fit(X_train_updrs4, y_train_updrs4)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs4_ridge = ridge.predict(X_test_updrs4)
y_pred_updrs4_ridge = np.where(y_pred_updrs4_ridge < 0, 0, y_pred_updrs4_ridge)

# Evaluate the performance of the Ridge model.
mse_updrs4_ridge = mean_squared_error(y_test_updrs4, y_pred_updrs4_ridge)
mae_updrs4_ridge = mean_absolute_error(y_test_updrs4, y_pred_updrs4_ridge)
r2_updrs4_ridge = r2_score(y_test_updrs4, y_pred_updrs4_ridge)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_4
print("mse_updrs4:", mse_updrs4_ridge)
print("mae_updrs4:", mae_updrs4_ridge)
print("r2_updrs4:", r2_updrs4_ridge)
print("SMAPE_updrs4:", smape(y_test_updrs4, y_pred_updrs4_ridge))

mse_updrs4: 7.223798871369722
mae_updrs4: 2.2338706225100564
r2_updrs4: 0.0215584307046357
SMAPE_updrs4: 152.4599661052674
mse_updrs4: 7.121404821443171
mae_updrs4: 2.202910431022824
r2_updrs4: 0.03542739309971488
SMAPE_updrs4: 152.73565228619927


## Results

In [58]:
# Add a title to the DataFrame.
print("The Results with Univariate Feature Selection, PCA, and Lasso")

# Create a dictionary with the metrics for each target.
metrics_dict_lasso = {
    'Target': ['UPDRS 1', 'UPDRS 2', 'UPDRS 3', 'UPDRS 4'],
    'MSE': [mse_updrs1_lasso, mse_updrs2_lasso, mse_updrs3_lasso, mse_updrs4_lasso],
    'MAE': [mae_updrs1_lasso, mae_updrs2_lasso, mae_updrs3_lasso, mae_updrs4_lasso],
    'R2': [r2_updrs1_lasso, r2_updrs2_lasso, r2_updrs3_lasso, r2_updrs4_lasso],
    'SMAPE': [smape(y_test_updrs1, y_pred_updrs1_lasso), smape(y_test_updrs2, y_pred_updrs2_lasso), 
              smape(y_test_updrs3, y_pred_updrs3_lasso), smape(y_test_updrs4, y_pred_updrs4_lasso)]
}

# Create a Pandas DataFrame from the dictionary.
metrics_df_lasso = pd.DataFrame(metrics_dict_lasso)

# Set the 'Target' column as the index.
metrics_df_lasso.set_index('Target', inplace = True)

# Display the DataFrame.
metrics_df_lasso

The Results with Univariate Feature Selection, PCA, and Lasso


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,23.249288,4.016411,-0.056789,75.608666
UPDRS 2,35.156373,4.766656,0.012407,103.805442
UPDRS 3,243.845296,13.017687,0.018579,96.449483
UPDRS 4,7.223799,2.233871,0.021558,152.459966


In [59]:
# Add a title to the DataFrame.
print("The Results with Univariate Feature Selection, PCA, and Ridge")

# Create a dictionary with the metrics for each target.
metrics_dict_ridge = {
    'Target': ['UPDRS 1', 'UPDRS 2', 'UPDRS 3', 'UPDRS 4'],
    'MSE': [mse_updrs1_ridge, mse_updrs2_ridge, mse_updrs3_ridge, mse_updrs4_ridge],
    'MAE': [mae_updrs1_ridge, mae_updrs2_ridge, mae_updrs3_ridge, mae_updrs4_ridge],
    'R2': [r2_updrs1_ridge, r2_updrs2_ridge, r2_updrs3_ridge, r2_updrs4_ridge],
    'SMAPE': [smape(y_test_updrs1, y_pred_updrs1_ridge), smape(y_test_updrs2, y_pred_updrs2_ridge), 
              smape(y_test_updrs3, y_pred_updrs3_ridge), smape(y_test_updrs4, y_pred_updrs4_ridge)]
}

# Create a Pandas DataFrame from the dictionary.
metrics_df_ridge = pd.DataFrame(metrics_dict_ridge)

# Set the 'Target' column as the index.
metrics_df_ridge.set_index('Target', inplace = True)

# Display the DataFrame.
metrics_df_ridge

The Results with Univariate Feature Selection, PCA, and Ridge


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,23.354055,4.016331,-0.030828,75.59928
UPDRS 2,35.177757,4.756698,0.011806,104.209599
UPDRS 3,244.088327,13.014122,0.017601,96.534643
UPDRS 4,7.121405,2.20291,0.035427,152.735652


In [60]:
# Add a title to the DataFrame.
print("The Results without Features Selection")

# comparison with the results without features selection
metrics_df_all

The Results without Features Selection


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,181.362567,8.976865,-7.243778,123.654261
UPDRS 2,141.061387,7.715337,-2.962616,119.040559
UPDRS 3,886.637958,21.082452,-2.568514,115.795856
UPDRS 4,11.084991,2.387561,-0.501428,122.636215


In [61]:
# Add a title to the DataFrame.
print("The Results with Univariate Feature Selection")

# comparison with the results with Univariate Feature Selection
metrics_df_KBest

The Results with Univariate Feature Selection


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,22.051638,3.902138,-0.00235,74.352861
UPDRS 2,34.647653,4.738632,0.026698,101.986585
UPDRS 3,237.379871,12.838956,0.044601,96.618764
UPDRS 4,7.907121,2.161096,-0.070995,148.004627


Based on the results, we can see that using Univariate Feature Selection, PCA, and Lasso and Ridge models **did not improve the performance of the models significantly compared to using only Univariate Feature Selection**. In fact, the R2 score and SMAPE values got worse when using PCA, and Lasso and Ridge models. Therefore, **it might be better to stick with the Univariate Feature Selection method** for this specific dataset and target. However, it's always good to experiment with different methods to see if they can improve the performance of the models.

# Random Forest Regressor Model with Univariate Feature Selection and PCA

To add Tree-Based methods, we can import the necessary classes and functions from **scikit-learn's tree module**. Next, we use the RandomForestRegressor class to **fit a Random Forest model** on the training set. The **n_estimators** parameter specifies the number of trees in the forest, and **random_state** is set for reproducibility. We then predict the values of the dependent variable on the testing set and evaluate the performance of the model using mean squared error, mean absolute error, r2 score, and SMAPE.

## updrs_1

In [62]:
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_regression

# Separate the dataset for updrs_1.
df_updrs1 = df[['updrs_1'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs1 = df_updrs1.iloc[:, 1:]
y_updrs1 = df_updrs1.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs1 = pca.fit_transform(X_updrs1)

# Select the top 10 features based on the F-test score.
selector1 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new1 = selector1.fit_transform(X_updrs1, y_updrs1)

# Get the indices of the selected features.
selected_indices1 = selector1.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs1 = df.columns[4:][selected_indices1]

In [63]:
# the selected features
X_new1

array([[-2.98551712e+06,  1.37471114e+07, -5.59155966e+05, ...,
        -8.12973464e+05, -3.76998956e+05, -3.55449277e+05],
       [-7.90717211e+06,  6.40671120e+06,  1.99162746e+06, ...,
        -2.57782203e+06, -7.96356032e+04, -2.73623494e+05],
       [-2.39707867e+06,  1.08494888e+05, -1.66312245e+06, ...,
        -1.49739517e+06, -2.82346017e+05,  6.93387095e+05],
       ...,
       [-3.44761598e+06, -2.81951556e+06,  7.47572913e+05, ...,
        -8.11629750e+05,  6.90787035e+05, -7.23149502e+05],
       [-3.50849222e+06, -4.61604150e+06,  8.78829064e+05, ...,
        -4.34797227e+05,  1.59489248e+05, -1.03625704e+06],
       [-4.48664213e+06,  1.17638299e+04, -1.46451375e+05, ...,
         1.40331410e+06,  4.41626854e+05, -7.82908290e+05]])

In [64]:
# The selected features does not include visit_month.
selected_X_updrs1

Index(['AADDTWEPFASGK', 'ADDKETC(UniMod_4)FAEEGK', 'ADDKETC(UniMod_4)FAEEGKK',
       'AEFAEVSK', 'AESPEVC(UniMod_4)FNEESPK', 'AGALNSNDAFVLK',
       'AGDFLEANYMNLQR', 'AGLAASLAGPHSIVGR', 'AIPVTQYLK', 'ALTDMPQMR'],
      dtype='object')

In [65]:
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor

# Separate the dataset for updrs_1.
df_updrs1 = df[['updrs_1'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs1 = df_updrs1.iloc[:, 1:]
y_updrs1 = df_updrs1.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs1 = pca.fit_transform(X_updrs1)

# Select the top 10 features with the highest F-values.
selector1 = SelectKBest(f_regression, k = 10)
X_new1 = selector1.fit_transform(X_updrs1, y_updrs1)

# Add visit_month column to X_new.
X_new1 = np.column_stack((df_updrs1.iloc[:, 1].values, X_new1))

# Split the dataset into training and testing sets.
X_train_updrs1, X_test_updrs1, y_train_updrs1, y_test_updrs1 = train_test_split(X_new1, y_updrs1, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler1 = StandardScaler()
X_train_updrs1 = scaler1.fit_transform(X_train_updrs1)
X_test_updrs1 = scaler1.transform(X_test_updrs1)

# Fit a Random Forest regression model on the training set.
model_updrs1 = RandomForestRegressor(n_estimators = 100, random_state = 42)
model_updrs1.fit(X_train_updrs1, y_train_updrs1)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs1 = model_updrs1.predict(X_test_updrs1)
y_pred_updrs1 = np.where(y_pred_updrs1 < 0, 0, y_pred_updrs1)

# Evaluate the performance of the model.
mse_updrs1 = mean_squared_error(y_test_updrs1, y_pred_updrs1)
mae_updrs1 = mean_absolute_error(y_test_updrs1, y_pred_updrs1)
r2_updrs1 = r2_score(y_test_updrs1, y_pred_updrs1)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_1
print("mse_updrs1:", mse_updrs1)
print("mae_updrs1:", mae_updrs1)
print("r2_updrs1:", r2_updrs1)
print("SMAPE_updrs1:", smape(y_test_updrs1, y_pred_updrs1))

mse_updrs1: 23.326457476635515
mae_updrs1: 4.015654205607476
r2_updrs1: -0.060296678838601014
SMAPE_updrs1: 75.87679659726628


## updrs_2

In [66]:
# Separate the dataset for updrs_2.
df_updrs2 = df[['updrs_2'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs2 = df_updrs2.iloc[:, 1:]
y_updrs2 = df_updrs2.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs2 = pca.fit_transform(X_updrs2)

# Select the top 10 features based on the F-test score.
selector2 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new2 = selector2.fit_transform(X_updrs2, y_updrs2)

# Get the indices of the selected features.
selected_indices2 = selector2.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs2 = df.columns[4:][selected_indices2]

In [67]:
# The selected features does not include visit_month.
selected_X_updrs2

Index(['AADDTWEPFASGK', 'AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K',
       'AANEVSSADVK', 'ADDKETC(UniMod_4)FAEEGK', 'AEFAEVSK',
       'AESPEVC(UniMod_4)FNEESPK', 'AGDFLEANYMNLQR', 'AGLQVYNK',
       'AIQLTYNPDESSKPNMIDAATLK', 'AKAYLEEEC(UniMod_4)PATLRK'],
      dtype='object')

In [68]:
# Separate the dataset for updrs_2.
df_updrs2 = df[['updrs_2'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs2 = df_updrs2.iloc[:, 1:]
y_updrs2 = df_updrs2.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs2 = pca.fit_transform(X_updrs2)

# Select the top 10 features with the highest F-values.
selector2 = SelectKBest(f_regression, k = 10)
X_new2 = selector2.fit_transform(X_updrs2, y_updrs2)

# Add visit_month column to X_new.
X_new2 = np.column_stack((df_updrs2.iloc[:, 1].values, X_new2))

# Split the dataset into training and testing sets.
X_train_updrs2, X_test_updrs2, y_train_updrs2, y_test_updrs2 = train_test_split(X_new2, y_updrs2, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler2 = StandardScaler()
X_train_updrs2 = scaler2.fit_transform(X_train_updrs2)
X_test_updrs2 = scaler2.transform(X_test_updrs2)

# Fit a Random Forest regression model on the training set.
model_updrs2 = RandomForestRegressor(n_estimators = 100, random_state = 42)
model_updrs2.fit(X_train_updrs2, y_train_updrs2)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs2 = model_updrs2.predict(X_test_updrs2)
y_pred_updrs2 = np.where(y_pred_updrs2 < 0, 0, y_pred_updrs2)

# Evaluate the performance of the model.
mse_updrs2 = mean_squared_error(y_test_updrs2, y_pred_updrs2)
mae_updrs2 = mean_absolute_error(y_test_updrs2, y_pred_updrs2)
r2_updrs2 = r2_score(y_test_updrs2, y_pred_updrs2)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_2
print("mse_updrs2:", mse_updrs2)
print("mae_updrs2:", mae_updrs2)
print("r2_updrs2:", r2_updrs2)
print("SMAPE_updrs2:", smape(y_test_updrs2, y_pred_updrs2))

mse_updrs2: 34.0467163551402
mae_updrs2: 4.657710280373832
r2_updrs2: 0.04357900012758775
SMAPE_updrs2: 102.22418920478191


## updrs_3

In [69]:
# Separate the dataset for updrs_3.
df_updrs3 = df[['updrs_3'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs3 = df_updrs3.iloc[:, 1:]
y_updrs3 = df_updrs3.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs3 = pca.fit_transform(X_updrs3)

# Select the top 10 features based on the F-test score.
selector3 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new3 = selector3.fit_transform(X_updrs3, y_updrs3)

# Get the indices of the selected features.
selected_indices3 = selector3.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs3 = df.columns[4:][selected_indices3]

In [70]:
# The selected features does not include visit_month.
selected_X_updrs3

Index(['AAFGQGSGPIMLDEVQC(UniMod_4)TGTEASLADC(UniMod_4)K',
       'ADDKETC(UniMod_4)FAEEGK', 'ADRDQYELLC(UniMod_4)LDNTR', 'AEFAEVSK',
       'AGC(UniMod_4)VAESTAVC(UniMod_4)R', 'AGDFLEANYMNLQR',
       'AIGAVPLIQGEYMIPC(UniMod_4)EK', 'AKLEEQAQQIR', 'ALFLETEQLK',
       'ALMSPAGMLR'],
      dtype='object')

In [71]:
# Separate the dataset for updrs_3.
df_updrs3 = df[['updrs_3'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs3 = df_updrs3.iloc[:, 1:]
y_updrs3 = df_updrs3.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs3 = pca.fit_transform(X_updrs3)

# Select the top 10 features with the highest F-values.
selector3 = SelectKBest(f_regression, k = 10)
X_new3 = selector3.fit_transform(X_updrs3, y_updrs3)

# Add visit_month column to X_new.
X_new3 = np.column_stack((df_updrs3.iloc[:, 1].values, X_new3))

# Split the dataset into training and testing sets.
X_train_updrs3, X_test_updrs3, y_train_updrs3, y_test_updrs3 = train_test_split(X_new3, y_updrs3, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler3 = StandardScaler()
X_train_updrs3 = scaler3.fit_transform(X_train_updrs3)
X_test_updrs3 = scaler3.transform(X_test_updrs3)

# Fit a Random Forest regression model on the training set.
model_updrs3 = RandomForestRegressor(n_estimators = 100, random_state = 42)
model_updrs3.fit(X_train_updrs3, y_train_updrs3)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs3 = model_updrs3.predict(X_test_updrs3)
y_pred_updrs3 = np.where(y_pred_updrs3 < 0, 0, y_pred_updrs3)

# Evaluate the performance of the model.
mse_updrs3 = mean_squared_error(y_test_updrs3, y_pred_updrs3)
mae_updrs3 = mean_absolute_error(y_test_updrs3, y_pred_updrs3)
r2_updrs3 = r2_score(y_test_updrs3, y_pred_updrs3)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_3
print("mse_updrs3:", mse_updrs3)
print("mae_updrs3:", mae_updrs3)
print("r2_updrs3:", r2_updrs3)
print("SMAPE_updrs3:", smape(y_test_updrs3, y_pred_updrs3))

mse_updrs3: 233.6827136792453
mae_updrs3: 12.71816037735849
r2_updrs3: 0.059480716169862724
SMAPE_updrs3: 95.02040491570445


## updrs_4

In [72]:
# Separate the dataset for updrs_4.
df_updrs4 = df[['updrs_4'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs4 = df_updrs4.iloc[:, 1:]
y_updrs4 = df_updrs4.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs4 = pca.fit_transform(X_updrs4)

# Select the top 10 features based on the F-test score.
selector4 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new4 = selector4.fit_transform(X_updrs4, y_updrs4)

# Get the indices of the selected features.
selected_indices4 = selector4.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs4 = df.columns[4:][selected_indices4]

In [73]:
# The selected features does not include visit_month.
selected_X_updrs4

Index(['AAFTEC(UniMod_4)C(UniMod_4)QAADK', 'AATGEC(UniMod_4)TATVGKR',
       'ADSGEGDFLAEGGGVR', 'AELQC(UniMod_4)PQPAA', 'AFPALTSLDLSDNPGLGER',
       'AGDFLEANYMNLQR', 'AIGYLNTGYQR', 'AKPALEDLR', 'ALFLETEQLK',
       'ALQDQLVLVAAK'],
      dtype='object')

In [74]:
# Separate the dataset for updrs_4.
df_updrs4 = df[['updrs_4'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs4 = df_updrs4.iloc[:, 1:]
y_updrs4 = df_updrs4.iloc[:, 0]

# Apply PCA to reduce the dimensionality of the data.
pca = PCA(n_components = 50)
X_updrs4 = pca.fit_transform(X_updrs4)

# Select the top 10 features with the highest F-values.
selector4 = SelectKBest(f_regression, k = 10)
X_new4 = selector4.fit_transform(X_updrs4, y_updrs4)

# Add visit_month column to X_new.
X_new4 = np.column_stack((df_updrs4.iloc[:, 1].values, X_new4))

# Split the dataset into training and testing sets.
X_train_updrs4, X_test_updrs4, y_train_updrs4, y_test_updrs4 = train_test_split(X_new4, y_updrs4, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler4 = StandardScaler()
X_train_updrs4 = scaler4.fit_transform(X_train_updrs4)
X_test_updrs4 = scaler4.transform(X_test_updrs4)

# Fit a Random Forest regression model on the training set.
model_updrs4 = RandomForestRegressor(n_estimators = 100, random_state = 42)
model_updrs4.fit(X_train_updrs4, y_train_updrs4)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs4 = model_updrs4.predict(X_test_updrs4)
y_pred_updrs4 = np.where(y_pred_updrs4 < 0, 0, y_pred_updrs4)

# Evaluate the performance of the model.
mse_updrs4 = mean_squared_error(y_test_updrs4, y_pred_updrs4)
mae_updrs4 = mean_absolute_error(y_test_updrs4, y_pred_updrs4)
r2_updrs4 = r2_score(y_test_updrs4, y_pred_updrs4)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_4
print("mse_updrs4:", mse_updrs4)
print("mae_updrs4:", mae_updrs4)
print("r2_updrs4:", r2_updrs4)
print("SMAPE_updrs4:", smape(y_test_updrs4, y_pred_updrs4))

mse_updrs4: 8.716224561403509
mae_updrs4: 2.3578947368421055
r2_updrs4: -0.18058608635837792
SMAPE_updrs4: 155.06059219742357


## Results

In [75]:
# Add a title to the DataFrame.
print("The Results with Random Forest Regressor, Univariate Feature Selection, and PCA")

# Create a dictionary with the metrics for each target.
metrics_dict_RFR = {
    'Target': ['UPDRS 1', 'UPDRS 2', 'UPDRS 3', 'UPDRS 4'],
    'MSE': [mse_updrs1, mse_updrs2, mse_updrs3, mse_updrs4],
    'MAE': [mae_updrs1, mae_updrs2, mae_updrs3, mae_updrs4],
    'R2': [r2_updrs1, r2_updrs2, r2_updrs3, r2_updrs4],
    'SMAPE': [smape(y_test_updrs1, y_pred_updrs1), smape(y_test_updrs2, y_pred_updrs2), 
              smape(y_test_updrs3, y_pred_updrs3), smape(y_test_updrs4, y_pred_updrs4)]
}

# Create a Pandas DataFrame from the dictionary.
metrics_df_RFR = pd.DataFrame(metrics_dict_RFR)

# Set the 'Target' column as the index.
metrics_df_RFR.set_index('Target', inplace = True)

# Display the DataFrame.
metrics_df_RFR

The Results with Random Forest Regressor, Univariate Feature Selection, and PCA


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,23.326457,4.015654,-0.060297,75.876797
UPDRS 2,34.046716,4.65771,0.043579,102.224189
UPDRS 3,233.682714,12.71816,0.059481,95.020405
UPDRS 4,8.716225,2.357895,-0.180586,155.060592


In [76]:
# Add a title to the DataFrame.
print("The Results without Features Selection")

# comparison with the results without features selection
metrics_df_all

The Results without Features Selection


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,181.362567,8.976865,-7.243778,123.654261
UPDRS 2,141.061387,7.715337,-2.962616,119.040559
UPDRS 3,886.637958,21.082452,-2.568514,115.795856
UPDRS 4,11.084991,2.387561,-0.501428,122.636215


In [77]:
# Add a title to the DataFrame.
print("The Results with Univariate Feature Selection")

# comparison with the results with Univariate Feature Selection
metrics_df_KBest

The Results with Univariate Feature Selection


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,22.051638,3.902138,-0.00235,74.352861
UPDRS 2,34.647653,4.738632,0.026698,101.986585
UPDRS 3,237.379871,12.838956,0.044601,96.618764
UPDRS 4,7.907121,2.161096,-0.070995,148.004627


Based on the results, we can see that using Univariate Feature Selection, PCA, and Random Forest regression models **did not improve the performance of the models significantly compared to using only Univariate Feature Selection**. In fact, some values got worse when using PCA and Random Forest regression models. Therefore, **it might be better to stick with the Univariate Feature Selection method** for this specific dataset and target. However, it's always good to experiment with different methods to see if they can improve the performance of the models.

# Create Dataset (Proteins)

Next, we will attempt to **select features for proteins instead of peptides.**

In [78]:
# Read the CSV files.
proteins = pd.read_csv('/kaggle/input/amp-parkinsons-disease-progression-prediction/train_proteins.csv')
peptides = pd.read_csv('/kaggle/input/amp-parkinsons-disease-progression-prediction/train_peptides.csv')
clinical = pd.read_csv('/kaggle/input/amp-parkinsons-disease-progression-prediction/train_clinical_data.csv')

# Merge the proteins data and peptides data on the common columns.
merged_proteins_peptides = pd.merge(proteins, peptides, on = ['visit_id', 'visit_month', 'patient_id', 'UniProt'])

# Merge the merged protein-peptides data with the clinical data on the common columns.
merged = pd.merge(merged_proteins_peptides, clinical, on = ['visit_id', 'visit_month', 'patient_id'])

merged['Peptide / Protein'] = merged['PeptideAbundance'] / merged['NPX']
merged

merged_dd = merged.iloc[:, :5].drop_duplicates(subset = ['visit_id', 'visit_month', 'patient_id', 'UniProt'])

# Pivot the data.
pivoted = merged_dd.pivot(index = 'visit_id', columns = ['UniProt'], values = 'NPX')

# Add visit_month, the 4 scores, and medication status.
df = pd.merge(clinical, pivoted, on = 'visit_id', how = 'right').set_index('visit_id')

# Insert the visit_month column to the desired position.
df.insert(6, 'visit_month', df.pop('visit_month'))

df = df.drop('patient_id', axis = 1)

# Replace NaN with 0 in the Proteins columns.
df.loc[:, 'O00391':] = df.loc[:, 'O00391':].fillna(0)

# Drop upd23b_clinical_state_on_medication column from the previous merged train dataset df.
df = df.drop('upd23b_clinical_state_on_medication', axis = 1)

# Show the table.
df

Unnamed: 0_level_0,updrs_1,updrs_2,updrs_3,updrs_4,visit_month,O00391,O00533,O00584,O14498,O14773,...,Q9HDC9,Q9NQ79,Q9NYU2,Q9UBR2,Q9UBX5,Q9UHG2,Q9UKV8,Q9UNU6,Q9Y646,Q9Y6R7
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
10053_0,3.0,0.0,13.0,0.0,0,9104.27,402321.0,0.00,0.0,7150.57,...,0.0,9469.45,94237.6,0.00,23016.0,177983.0,65900.0,15382.0,0.00,19017.40
10053_12,4.0,2.0,8.0,0.0,12,10464.20,435586.0,0.00,0.0,0.00,...,0.0,14408.40,0.0,0.00,28537.0,171733.0,65668.1,0.0,9295.65,25697.80
10053_18,2.0,2.0,0.0,0.0,18,13235.70,507386.0,7126.96,24525.7,0.00,...,317477.0,38667.20,111107.0,0.00,37932.6,245188.0,59986.1,10813.3,0.00,29102.70
10138_12,3.0,6.0,31.0,0.0,12,12600.20,494581.0,9165.06,27193.5,22506.10,...,557904.0,44556.90,155619.0,14647.90,36927.7,229232.0,106564.0,26077.7,21441.80,7642.42
10138_24,4.0,7.0,19.0,10.0,24,12003.20,522138.0,4498.51,17189.8,29112.40,...,0.0,47836.70,177619.0,17061.10,25510.4,176722.0,59471.4,12639.2,15091.40,6168.55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8699_24,11.0,10.0,13.0,2.0,24,9983.00,400290.0,24240.10,0.0,16943.50,...,0.0,25690.60,0.0,6859.82,19106.7,121161.0,113872.0,14413.9,28225.50,8062.07
942_12,5.0,2.0,25.0,0.0,12,6757.32,360858.0,18367.60,14760.7,18603.40,...,45742.3,33518.60,94049.7,13415.70,21324.7,234094.0,82410.4,19183.7,17804.10,12277.00
942_24,2.0,3.0,23.0,,24,0.00,352722.0,22834.90,23393.1,16693.50,...,180475.0,29770.60,95949.9,11344.40,23637.6,256654.0,76931.9,19168.2,19215.90,14625.60
942_48,2.0,6.0,35.0,0.0,48,11627.80,251820.0,22046.50,26360.5,22440.20,...,197987.0,29283.80,121696.0,19169.80,16724.9,232301.0,96905.9,21120.9,14089.80,16418.50


# Linear Regression Model with Univariate Feature Selection (Proteins)

## updrs_1

In [79]:
from sklearn.feature_selection import SelectKBest, f_regression

# Separate the dataset for updrs_1.
df_updrs1 = df[['updrs_1'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs1 = df_updrs1.iloc[:, 1:]
y_updrs1 = df_updrs1.iloc[:, 0]

# Select the top 10 features based on the F-test score.
selector1 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new1 = selector1.fit_transform(X_updrs1, y_updrs1)

# Get the indices of the selected features.
selected_indices1 = selector1.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs1 = X_updrs1.columns[selected_indices1]

In [80]:
# the selected features
X_new1

array([[ 56798.8 , 117800.  ,      0.  , ...,      0.  ,      0.  ,
             0.  ],
       [ 50378.9 ,  82773.8 ,      0.  , ...,      0.  ,      0.  ,
         16311.6 ],
       [ 55975.3 , 224830.  ,      0.  , ...,  14336.  ,  19092.7 ,
         26687.2 ],
       ...,
       [ 63576.2 , 261018.  ,   4758.23, ...,  33745.2 ,      0.  ,
         18745.8 ],
       [ 67315.  , 218842.  ,   4936.86, ...,  28882.6 ,      0.  ,
         24418.9 ],
       [ 78920.8 , 259747.  ,   6398.05, ...,  31363.6 ,  12277.  ,
         24243.2 ]])

In [81]:
# The selected features does not include visit_month.
selected_X_updrs1

Index(['P04180', 'P05060', 'P05408', 'P13591', 'P14618', 'P17174', 'P43121',
       'Q06481', 'Q99829', 'Q9BY67'],
      dtype='object')

In [82]:
from sklearn.feature_selection import SelectKBest, f_regression

# Separate the dataset for updrs_1.
df_updrs1 = df[['updrs_1'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs1 = df_updrs1.iloc[:, 1:]
y_updrs1 = df_updrs1.iloc[:, 0]

# Select the top 10 features with the highest F-values.
selector1 = SelectKBest(f_regression, k = 10)
X_new1 = selector1.fit_transform(X_updrs1, y_updrs1)

# Add visit_month column to X_new.
X_new1 = np.column_stack((df_updrs1.iloc[:, 1].values, X_new1))

# Split the dataset into training and testing sets.
X_train_updrs1, X_test_updrs1, y_train_updrs1, y_test_updrs1 = train_test_split(X_new1, y_updrs1, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler1 = StandardScaler()
X_train_updrs1 = scaler1.fit_transform(X_train_updrs1)
X_test_updrs1 = scaler1.transform(X_test_updrs1)

# Fit a linear regression model on the training set.
model_updrs1 = LinearRegression()
model_updrs1.fit(X_train_updrs1, y_train_updrs1)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs1 = model_updrs1.predict(X_test_updrs1)
y_pred_updrs1 = np.where(y_pred_updrs1 < 0, 0, y_pred_updrs1)

# Evaluate the performance of the model.
mse_updrs1 = mean_squared_error(y_test_updrs1, y_pred_updrs1)
mae_updrs1 = mean_absolute_error(y_test_updrs1, y_pred_updrs1)
r2_updrs1 = r2_score(y_test_updrs1, y_pred_updrs1)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_1
print("mse_updrs1:", mse_updrs1)
print("mae_updrs1:", mae_updrs1)
print("r2_updrs1:", r2_updrs1)
print("SMAPE_updrs1:", smape(y_test_updrs1, y_pred_updrs1))

mse_updrs1: 21.60860577275993
mae_updrs1: 3.8564748482972804
r2_updrs1: 0.017787722025993102
SMAPE_updrs1: 74.18462172397935


## updrs_2

In [83]:
# Separate the dataset for updrs_2.
df_updrs2 = df[['updrs_2'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs2 = df_updrs2.iloc[:, 1:]
y_updrs2 = df_updrs2.iloc[:, 0]

# Select the top 10 features based on the F-test score.
selector2 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new2 = selector2.fit_transform(X_updrs2, y_updrs2)

# Get the indices of the selected features.
selected_indices2 = selector2.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs2 = X_updrs2.columns[selected_indices2]

In [84]:
# The selected features does not include visit_month.
selected_X_updrs2

Index(['O00533', 'O15240', 'P02787', 'P04180', 'P05060', 'P13521', 'P17174',
       'P40925', 'P43121', 'Q06481'],
      dtype='object')

In [85]:
# Separate the dataset for updrs_2.
df_updrs2 = df[['updrs_2'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs2 = df_updrs2.iloc[:, 1:]
y_updrs2 = df_updrs2.iloc[:, 0]

# Select the top 10 features with the highest F-values.
selector2 = SelectKBest(f_regression, k = 10)
X_new2 = selector2.fit_transform(X_updrs2, y_updrs2)

# Add visit_month column to X_new.
X_new2 = np.column_stack((df_updrs2.iloc[:, 1].values, X_new2))

# Split the dataset into training and testing sets.
X_train_updrs2, X_test_updrs2, y_train_updrs2, y_test_updrs2 = train_test_split(X_new2, y_updrs2, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler2 = StandardScaler()
X_train_updrs2 = scaler2.fit_transform(X_train_updrs2)
X_test_updrs2 = scaler2.transform(X_test_updrs2)

# Fit a linear regression model on the training set.
model_updrs2 = LinearRegression()
model_updrs2.fit(X_train_updrs2, y_train_updrs2)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs2 = model_updrs2.predict(X_test_updrs2)
y_pred_updrs2 = np.where(y_pred_updrs2 < 0, 0, y_pred_updrs2)

# Evaluate the performance of the model.
mse_updrs2 = mean_squared_error(y_test_updrs2, y_pred_updrs2)
mae_updrs2 = mean_absolute_error(y_test_updrs2, y_pred_updrs2)
r2_updrs2 = r2_score(y_test_updrs2, y_pred_updrs2)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_2
print("mse_updrs2:", mse_updrs2)
print("mae_updrs2:", mae_updrs2)
print("r2_updrs2:", r2_updrs2)
print("SMAPE_updrs2:", smape(y_test_updrs2, y_pred_updrs2))

mse_updrs2: 34.01395764000763
mae_updrs2: 4.659282689680993
r2_updrs2: 0.04449923932936006
SMAPE_updrs2: 102.9078695422691


## updrs_3

In [86]:
# Separate the dataset for updrs_3.
df_updrs3 = df[['updrs_3'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs3 = df_updrs3.iloc[:, 1:]
y_updrs3 = df_updrs3.iloc[:, 0]

# Select the top 10 features based on the F-test score.
selector3 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new3 = selector3.fit_transform(X_updrs3, y_updrs3)

# Get the indices of the selected features.
selected_indices3 = selector3.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs3 = X_updrs3.columns[selected_indices3]

In [87]:
# The selected features does not include visit_month.
selected_X_updrs3

Index(['O00533', 'O15240', 'P05060', 'P05067', 'P13521', 'P17174', 'P40925',
       'P43121', 'Q06481', 'Q6UXD5'],
      dtype='object')

In [88]:
# Separate the dataset for updrs_3.
df_updrs3 = df[['updrs_3'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs3 = df_updrs3.iloc[:, 1:]
y_updrs3 = df_updrs3.iloc[:, 0]

# Select the top 10 features with the highest F-values.
selector3 = SelectKBest(f_regression, k = 10)
X_new3 = selector3.fit_transform(X_updrs3, y_updrs3)

# Add visit_month column to X_new.
X_new3 = np.column_stack((df_updrs3.iloc[:, 1].values, X_new3))

# Split the dataset into training and testing sets.
X_train_updrs3, X_test_updrs3, y_train_updrs3, y_test_updrs3 = train_test_split(X_new3, y_updrs3, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler3 = StandardScaler()
X_train_updrs3 = scaler3.fit_transform(X_train_updrs3)
X_test_updrs3 = scaler3.transform(X_test_updrs3)

# Fit a linear regression model on the training set.
model_updrs3 = LinearRegression()
model_updrs3.fit(X_train_updrs3, y_train_updrs3)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs3 = model_updrs3.predict(X_test_updrs3)
y_pred_updrs3 = np.where(y_pred_updrs3 < 0, 0, y_pred_updrs3)

# Evaluate the performance of the model.
mse_updrs3 = mean_squared_error(y_test_updrs3, y_pred_updrs3)
mae_updrs3 = mean_absolute_error(y_test_updrs3, y_pred_updrs3)
r2_updrs3 = r2_score(y_test_updrs3, y_pred_updrs3)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_3
print("mse_updrs3:", mse_updrs3)
print("mae_updrs3:", mae_updrs3)
print("r2_updrs3:", r2_updrs3)
print("SMAPE_updrs3:", smape(y_test_updrs3, y_pred_updrs3))

mse_updrs3: 245.92944157101635
mae_updrs3: 13.095340186668206
r2_updrs3: 0.010190447477411713
SMAPE_updrs3: 97.49542886717269


## updrs_4

In [89]:
# Separate the dataset for updrs_4.
df_updrs4 = df[['updrs_4'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs4 = df_updrs4.iloc[:, 1:]
y_updrs4 = df_updrs4.iloc[:, 0]

# Select the top 10 features based on the F-test score.
selector4 = SelectKBest(score_func = f_regression, k = 10)

# Fit the selector on the independent variables and target variable.
X_new4 = selector4.fit_transform(X_updrs4, y_updrs4)

# Get the indices of the selected features.
selected_indices4 = selector4.get_support(indices = True)

# Get the names of the selected features.
selected_X_updrs4 = X_updrs4.columns[selected_indices4]

In [90]:
# The selected features does not include visit_month.
selected_X_updrs4

Index(['P02747', 'P02774', 'P04211', 'P04217', 'P05155', 'P06454', 'P06681',
       'P23083', 'P39060', 'Q12841'],
      dtype='object')

In [91]:
# Separate the dataset for updrs_4.
df_updrs4 = df[['updrs_4'] + list(df.columns[4:])].dropna()

# Separate the independent variables (predictors) and the dependent variable (target).
X_updrs4 = df_updrs4.iloc[:, 1:]
y_updrs4 = df_updrs4.iloc[:, 0]

# Select the top 10 features with the highest F-values.
selector4 = SelectKBest(f_regression, k = 10)
X_new4 = selector4.fit_transform(X_updrs4, y_updrs4)

# Add visit_month column to X_new.
X_new4 = np.column_stack((df_updrs4.iloc[:, 1].values, X_new4))

# Split the dataset into training and testing sets.
X_train_updrs4, X_test_updrs4, y_train_updrs4, y_test_updrs4 = train_test_split(X_new4, y_updrs4, test_size = 0.2, random_state = 42)

# Standardize the independent variables.
scaler4 = StandardScaler()
X_train_updrs4 = scaler4.fit_transform(X_train_updrs4)
X_test_updrs4 = scaler4.transform(X_test_updrs4)

# Fit a linear regression model on the training set.
model_updrs4 = LinearRegression()
model_updrs4.fit(X_train_updrs4, y_train_updrs4)

# Predict the values of the dependent variable (target) on the testing set.
y_pred_updrs4 = model_updrs4.predict(X_test_updrs4)
y_pred_updrs4 = np.where(y_pred_updrs4 < 0, 0, y_pred_updrs4)

# Evaluate the performance of the model.
mse_updrs4 = mean_squared_error(y_test_updrs4, y_pred_updrs4)
mae_updrs4 = mean_absolute_error(y_test_updrs4, y_pred_updrs4)
r2_updrs4 = r2_score(y_test_updrs4, y_pred_updrs4)

# mean squared error, mean absolute error, r2 score, and SMAPE for updrs_4
print("mse_updrs4:", mse_updrs4)
print("mae_updrs4:", mae_updrs4)
print("r2_updrs4:", r2_updrs4)
print("SMAPE_updrs4:", smape(y_test_updrs4, y_pred_updrs4))

mse_updrs4: 6.9497168882682
mae_updrs4: 2.111500219326687
r2_updrs4: 0.05868200106375754
SMAPE_updrs4: 152.91243826705175


## Results

In [92]:
# Add a title to the DataFrame.
print("The Results with Univariate Feature Selection (Proteins)")

# Create a dictionary with the metrics for each target.
metrics_dict_KBest_pro = {
    'Target': ['UPDRS 1', 'UPDRS 2', 'UPDRS 3', 'UPDRS 4'],
    'MSE': [mse_updrs1, mse_updrs2, mse_updrs3, mse_updrs4],
    'MAE': [mae_updrs1, mae_updrs2, mae_updrs3, mae_updrs4],
    'R2': [r2_updrs1, r2_updrs2, r2_updrs3, r2_updrs4],
    'SMAPE': [smape(y_test_updrs1, y_pred_updrs1), smape(y_test_updrs2, y_pred_updrs2), 
              smape(y_test_updrs3, y_pred_updrs3), smape(y_test_updrs4, y_pred_updrs4)]
}

# Create a Pandas DataFrame from the dictionary.
metrics_df_KBest_pro = pd.DataFrame(metrics_dict_KBest_pro)

# Set the 'Target' column as the index.
metrics_df_KBest_pro.set_index('Target', inplace = True)

# Display the DataFrame.
metrics_df_KBest_pro

The Results with Univariate Feature Selection (Proteins)


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,21.608606,3.856475,0.017788,74.184622
UPDRS 2,34.013958,4.659283,0.044499,102.90787
UPDRS 3,245.929442,13.09534,0.01019,97.495429
UPDRS 4,6.949717,2.1115,0.058682,152.912438


In [93]:
# Add a title to the DataFrame.
print("The Results with Univariate Feature Selection (Peptides)")

# comparison with the results with Univariate Feature Selection (Peptides)
metrics_df_KBest

The Results with Univariate Feature Selection (Peptides)


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,22.051638,3.902138,-0.00235,74.352861
UPDRS 2,34.647653,4.738632,0.026698,101.986585
UPDRS 3,237.379871,12.838956,0.044601,96.618764
UPDRS 4,7.907121,2.161096,-0.070995,148.004627


Surprisingly, **both results show similar performances whether we select peptides or proteins** as independent variables and perform their features selection.

Finally, we will try to build a model **only with 'visit_month'** to see better results.  

# Only Visit Month

In [94]:
# Read train CSV files.
train = pd.read_csv("/kaggle/input/amp-parkinsons-disease-progression-prediction/train_clinical_data.csv")
sup = pd.read_csv('/kaggle/input/amp-parkinsons-disease-progression-prediction/supplemental_clinical_data.csv')
train = train.append(sup, ignore_index = True)

# Create datasets.
train1 = train[['visit_month'] + ['updrs_1']].dropna()
train2 = train[['visit_month'] + ['updrs_2']].dropna()
train3 = train[['visit_month'] + ['updrs_3']].dropna()
train4 = train[['visit_month'] + ['updrs_4']].dropna()

x_data1 = train1['visit_month'].values.reshape(-1, 1)
x_data2 = train2['visit_month'].values.reshape(-1, 1)
x_data3 = train3['visit_month'].values.reshape(-1, 1)
x_data4 = train4['visit_month'].values.reshape(-1, 1)
y_data1 = train1['updrs_1']
y_data2 = train2['updrs_2']
y_data3 = train3['updrs_3']
y_data4 = train4['updrs_4']

# Split the dataset into training and testing sets.
x_train1, x_test1, y_train1, y_test1 = train_test_split(x_data1, y_data1, test_size = 0.2, random_state = 42)
x_train2, x_test2, y_train2, y_test2 = train_test_split(x_data2, y_data2, test_size = 0.2, random_state = 42)
x_train3, x_test3, y_train3, y_test3 = train_test_split(x_data3, y_data3, test_size = 0.2, random_state = 42)
x_train4, x_test4, y_train4, y_test4 = train_test_split(x_data4, y_data4, test_size = 0.2, random_state = 42)

In [95]:
# Fit a linear regression model on the training set.
model_updrs1 = LinearRegression()
model_updrs2 = LinearRegression()
model_updrs3 = LinearRegression()
model_updrs4 = LinearRegression()

model_updrs1.fit(x_train1, y_train1)
model_updrs2.fit(x_train2, y_train2)
model_updrs3.fit(x_train3, y_train3)
model_updrs4.fit(x_train4, y_train4)

# Predict the values of the dependent variable (target) on the testing set.
y_pred1 = model_updrs1.predict(x_test1)
y_pred1 = np.where(y_pred1 < 0, 0, y_pred1)
y_pred2 = model_updrs2.predict(x_test2)
y_pred2 = np.where(y_pred2 < 0, 0, y_pred2)
y_pred3 = model_updrs3.predict(x_test3)
y_pred3 = np.where(y_pred3 < 0, 0, y_pred3)
y_pred4 = model_updrs4.predict(x_test4)
y_pred4 = np.where(y_pred4 < 0, 0, y_pred4)

# Evaluate the performance of the model.
mse_updrs1 = mean_squared_error(y_test1, y_pred1)
mae_updrs1 = mean_absolute_error(y_test1, y_pred1)
r2_updrs1 = r2_score(y_test1, y_pred1)

mse_updrs2 = mean_squared_error(y_test2, y_pred2)
mae_updrs2 = mean_absolute_error(y_test2, y_pred2)
r2_updrs2 = r2_score(y_test2, y_pred2)

mse_updrs3 = mean_squared_error(y_test3, y_pred3)
mae_updrs3 = mean_absolute_error(y_test3, y_pred3)
r2_updrs3 = r2_score(y_test3, y_pred3)

mse_updrs4 = mean_squared_error(y_test4, y_pred4)
mae_updrs4 = mean_absolute_error(y_test4, y_pred4)
r2_updrs4 = r2_score(y_test4, y_pred4)

In [96]:
# Add a title to the DataFrame.
print("The Results with 'visit_month' Only")

# Create a dictionary with the metrics for each target.
metrics_dict = {
    'Target': ['UPDRS 1', 'UPDRS 2', 'UPDRS 3', 'UPDRS 4'],
    'MSE': [mse_updrs1, mse_updrs2, mse_updrs3, mse_updrs4],
    'MAE': [mae_updrs1, mae_updrs2, mae_updrs3, mae_updrs4],
    'R2': [r2_updrs1, r2_updrs2, r2_updrs3, r2_updrs4],
    'SMAPE': [smape(y_test1, y_pred1), smape(y_test2, y_pred2), 
              smape(y_test3, y_pred3), smape(y_test4, y_pred4)]
}

# Create a Pandas DataFrame from the dictionary.
metrics = pd.DataFrame(metrics_dict)

# Set the 'Target' column as the index.
metrics.set_index('Target', inplace = True)

# Display the DataFrame.
metrics

The Results with 'visit_month' Only


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,27.457062,3.917767,0.026106,64.522139
UPDRS 2,33.301509,4.649056,0.015607,79.930793
UPDRS 3,209.25881,11.545114,-0.000532,63.970607
UPDRS 4,6.838405,1.863873,0.044441,162.859787


In [97]:
# Add a title to the DataFrame.
print("The Results with Univariate Feature Selection")

# comparison with the results in the previous time
metrics_df_KBest

The Results with Univariate Feature Selection


Unnamed: 0_level_0,MSE,MAE,R2,SMAPE
Target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
UPDRS 1,22.051638,3.902138,-0.00235,74.352861
UPDRS 2,34.647653,4.738632,0.026698,101.986585
UPDRS 3,237.379871,12.838956,0.044601,96.618764
UPDRS 4,7.907121,2.161096,-0.070995,148.004627


# Conclusion

So far, **we could not actually discover peptides or proteins that are truly relevant to the clinical symptoms**. In addition, **we can collect information as to visit_month from a larger number of patients**. As a result, **we had better use visit_month only as the independent variable to predict the updrs scores**. The number of subjects is more important than the information as to peptides or proteins.

**The subtitle of this competition is 'Use protein and peptide data measurements from Parkinson's Disease patients to predict progression of the disease.'** Thus, **it seems strange to get the best score only with visit_month**. We should attempt to **combine** the results from **the visit_month models and** the results from **the peptides models**.

I am a medical doctor working on **artificial intelligence (AI) for medicine**. At present AI is also widely used in the medical field. Particularly, AI performs in the healthcare sector following tasks: **image classification, object detection, semantic segmentation, GANs, text classification, etc**. **If you are interested in AI for medicine, please see my other notebooks.**