## AMP®-Parkinson's Disease Progression Prediction
### Use protein and peptide data measurements from Parkinson's Disease patients to predict progression of the disease

## Performed the below tasks
**Step-1 Understanding the buisness problem/ problem statement**

**Step-2 Getting data (Importing by Pandas)**

**Step-3 Understanding about the data**

**Step-4 Data cleaning**

**Step-5 Data visualization**

**Step-6 EDA Exploratory data analysis**

**Step-7 Feature Engineering**

**Step-8 Feature selection**

**Step-9 Splitting the data**

**Step-10 Model building** 

**Step-11 Prediction and accuracy**

**Step-12 Tunning and improving accuracy**

### Step-1 Understanding the buisness problem/ problem statement

### Problem Statement 
#### The goal of this competition is to predict MDS-UPDR scores, which measure progression in patients with Parkinson's disease. 

#### The Movement Disorder Society-Sponsored Revision of the Unified Parkinson's Disease Rating Scale (MDS-UPDRS) is a comprehensive assessment of both motor and non-motor symptoms associated with Parkinson's. 

#### You will develop a model trained on data of protein and peptide levels over time in subjects with Parkinson’s disease versus normal age-matched control subjects.

#### Your work could help provide important breakthrough information about which molecules change as Parkinson’s disease progresses.

In [1]:
# Importing Require Library

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
# Those below are used to change the display options for pandas DataFrames 
# In order to display all the columns or rows of the DataFrame, respectively.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

### Step-2 Getting data (Importing by Pandas)

In [5]:
# Importing Datasets

In [6]:
# Import train_peptides.csv
train_peptides_df = pd.read_csv('train_peptides.csv')

# Import train_proteins.csv
train_proteins_df = pd.read_csv('train_proteins.csv')

# Import train_clinical_data.csv
train_clinical_data_df = pd.read_csv('train_clinical_data.csv')

# Import supplemental_clinical_data.csv
supplemental_clinical_data_df = pd.read_csv('supplemental_clinical_data.csv')


### Step-3 Understanding about the data

In [7]:
train_peptides_df.head()

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,Peptide,PeptideAbundance
0,55_0,0,55,O00391,NEQEQPLGQWHLS,11254.3
1,55_0,0,55,O00533,GNPEPTFSWTK,102060.0
2,55_0,0,55,O00533,IEIPSSVQQVPTIIK,174185.0
3,55_0,0,55,O00533,KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK,27278.9
4,55_0,0,55,O00533,SMEQNGPGLEYR,30838.7


In [8]:
train_proteins_df.head()

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,NPX
0,55_0,0,55,O00391,11254.3
1,55_0,0,55,O00533,732430.0
2,55_0,0,55,O00584,39585.8
3,55_0,0,55,O14498,41526.9
4,55_0,0,55,O14773,31238.0


In [9]:
train_clinical_data_df.head()

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication
0,55_0,55,0,10.0,6.0,15.0,,
1,55_3,55,3,10.0,7.0,25.0,,
2,55_6,55,6,8.0,10.0,34.0,,
3,55_9,55,9,8.0,9.0,30.0,0.0,On
4,55_12,55,12,10.0,10.0,41.0,0.0,On


In [10]:
supplemental_clinical_data_df.head(3)

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication
0,35_0,35,0,5.0,3.0,16.0,0.0,
1,35_36,35,36,6.0,4.0,20.0,0.0,
2,75_0,75,0,4.0,6.0,26.0,0.0,


### Step-4 Data cleaning

In [11]:
# Rename columns for train_peptides_df
train_peptides_df.rename(columns={'Peptide': 'Peptide_Seq',
                                  'PeptideAbundance': 'Peptide_Abundance'}, inplace=True)

# Rename columns for train_proteins_df
train_proteins_df.rename(columns={'NPX': 'NPX_Level'}, inplace=True)

# Rename columns for train_clinical_data_df
train_clinical_data_df.rename(columns={'upd23b_clinical_state_on_medication': 'On_Medication'}, inplace=True)

# Rename columns for supplemental_clinical_data_df
supplemental_clinical_data_df.rename(columns={'upd23b_clinical_state_on_medication': 'On_Medication'}, inplace=True)


In [12]:
train_peptides_df.head(2)

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,Peptide_Seq,Peptide_Abundance
0,55_0,0,55,O00391,NEQEQPLGQWHLS,11254.3
1,55_0,0,55,O00533,GNPEPTFSWTK,102060.0


In [13]:
train_peptides_df['patient_id'].nunique()

248

In [14]:
train_proteins_df.head(2)

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,NPX_Level
0,55_0,0,55,O00391,11254.3
1,55_0,0,55,O00533,732430.0


In [15]:
train_proteins_df['patient_id'].nunique()

248

In [16]:
# Merged two csv train_peptides_df, train_proteins_df 
merged_df = pd.merge(train_peptides_df, train_proteins_df, on=['visit_id', 'visit_month', 'patient_id', 'UniProt'])


In [17]:
merged_df['patient_id'].nunique()

248

In [18]:
merged_df.head(5)

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,Peptide_Seq,Peptide_Abundance,NPX_Level
0,55_0,0,55,O00391,NEQEQPLGQWHLS,11254.3,11254.3
1,55_0,0,55,O00533,GNPEPTFSWTK,102060.0,732430.0
2,55_0,0,55,O00533,IEIPSSVQQVPTIIK,174185.0,732430.0
3,55_0,0,55,O00533,KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK,27278.9,732430.0
4,55_0,0,55,O00533,SMEQNGPGLEYR,30838.7,732430.0


In [19]:
train_clinical_data_df.tail(2)

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,On_Medication
2613,65043_72,65043,72,3.0,9.0,14.0,1.0,Off
2614,65043_84,65043,84,7.0,9.0,20.0,3.0,Off


In [20]:
train_clinical_data_df['patient_id'].nunique()

248

In [21]:
supplemental_clinical_data_df.head(2)

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,On_Medication
0,35_0,35,0,5.0,3.0,16.0,0.0,
1,35_36,35,36,6.0,4.0,20.0,0.0,


In [22]:
supplemental_clinical_data_df['patient_id'].nunique()

771

In [23]:
# Concat two csv train_clinical_data_df, supplemental_clinical_data_df

merged_clinical_data = pd.concat([train_clinical_data_df, supplemental_clinical_data_df], ignore_index=True).drop_duplicates()

In [24]:
merged_clinical_data['patient_id'].nunique()

1019

In [25]:
merged_clinical_data.head(2)

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,On_Medication
0,55_0,55,0,10.0,6.0,15.0,,
1,55_3,55,3,10.0,7.0,25.0,,


In [26]:
merged_clinical_data.shape

(4838, 8)

In [27]:
merged_df.shape

(981834, 7)

In [28]:
4838+981834

986672

In [29]:
# Now I have two Data Frame merged_df, merged_clinical_data
# I mergeed both in single Data frame 

merged_data = pd.merge(merged_df, merged_clinical_data, on=['patient_id', 'visit_id', 'visit_month'])

In [30]:
merged_data.head(2)

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,Peptide_Seq,Peptide_Abundance,NPX_Level,updrs_1,updrs_2,updrs_3,updrs_4,On_Medication
0,55_0,0,55,O00391,NEQEQPLGQWHLS,11254.3,11254.3,10.0,6.0,15.0,,
1,55_0,0,55,O00533,GNPEPTFSWTK,102060.0,732430.0,10.0,6.0,15.0,,


In [31]:
merged_data.shape

(941744, 12)

In [32]:
merged_data.drop_duplicates(inplace=True)

In [33]:
merged_data.shape

(941744, 12)

In [34]:
merged_data = merged_data.rename(columns={'visit_id': 'Visit ID',
                                          'visit_month': 'Visit Month',
                                          'patient_id': 'Patient ID',
                                          'UniProt': 'Protein ID',
                                          'Peptide_Seq': 'Peptide Sequence',
                                          'Peptide_Abundance': 'Peptide Abundance',
                                          'NPX_Level': 'Protein Abundance',
                                          'updrs_1': 'UPDRS Part I',
                                          'updrs_2': 'UPDRS Part II',
                                          'updrs_3': 'UPDRS Part III',
                                          'updrs_4': 'UPDRS Part IV',
                                          'On_Medication': 'On Medication'})


In [35]:
merged_data.head(2)

Unnamed: 0,Visit ID,Visit Month,Patient ID,Protein ID,Peptide Sequence,Peptide Abundance,Protein Abundance,UPDRS Part I,UPDRS Part II,UPDRS Part III,UPDRS Part IV,On Medication
0,55_0,0,55,O00391,NEQEQPLGQWHLS,11254.3,11254.3,10.0,6.0,15.0,,
1,55_0,0,55,O00533,GNPEPTFSWTK,102060.0,732430.0,10.0,6.0,15.0,,


### Step-4 Data cleaning

In [36]:
merged_data.isna().sum()

Visit ID                  0
Visit Month               0
Patient ID                0
Protein ID                0
Peptide Sequence          0
Peptide Abundance         0
Protein Abundance         0
UPDRS Part I              0
UPDRS Part II             0
UPDRS Part III         9120
UPDRS Part IV        446214
On Medication        550019
dtype: int64

In [37]:
merged_data.dropna(inplace=True)

In [38]:
merged_data.shape

(381841, 12)

In [39]:
merged_data.head(2)

Unnamed: 0,Visit ID,Visit Month,Patient ID,Protein ID,Peptide Sequence,Peptide Abundance,Protein Abundance,UPDRS Part I,UPDRS Part II,UPDRS Part III,UPDRS Part IV,On Medication
931,1517_0,0,1517,O00391,NEQEQPLGQWHLS,11648.9,11648.9,11.0,6.0,25.0,5.0,On
932,1517_0,0,1517,O00533,GNPEPTFSWTK,63593.4,419015.0,11.0,6.0,25.0,5.0,On


The MDS-UPDRS scores range from 0 to 356, with the following classification for each range:

0 to 8: Normal
9 to 16: Slight
17 to 32: Mild
33 to 68: Moderate
69 and above: Severe
These ranges are based on the official MDS-UPDRS rating scale. However, it's worth noting that these ranges may vary depending on the specific study or research context.

In [40]:
merged_data['Total UPDRS Score'] = merged_data['UPDRS Part I'] + merged_data['UPDRS Part II'] + merged_data['UPDRS Part III'] + merged_data['UPDRS Part IV']


In [41]:
merged_data.head(2)

Unnamed: 0,Visit ID,Visit Month,Patient ID,Protein ID,Peptide Sequence,Peptide Abundance,Protein Abundance,UPDRS Part I,UPDRS Part II,UPDRS Part III,UPDRS Part IV,On Medication,Total UPDRS Score
931,1517_0,0,1517,O00391,NEQEQPLGQWHLS,11648.9,11648.9,11.0,6.0,25.0,5.0,On,47.0
932,1517_0,0,1517,O00533,GNPEPTFSWTK,63593.4,419015.0,11.0,6.0,25.0,5.0,On,47.0


In [42]:
merged_data['Total UPDRS Score'].value_counts()

40.0     14790
39.0     14350
54.0      9546
33.0      9523
28.0      9059
61.0      8949
41.0      8814
30.0      8785
43.0      7699
34.0      7420
37.0      7066
38.0      7064
52.0      7038
44.0      6990
36.0      6967
50.0      6888
31.0      6849
27.0      6652
25.0      6523
57.0      6337
69.0      6290
22.0      6229
26.0      6129
65.0      6120
46.0      6107
47.0      6061
19.0      6053
48.0      6038
75.0      5376
62.0      5325
20.0      5256
58.0      5168
35.0      5117
64.0      5034
51.0      5014
59.0      4857
32.0      4494
63.0      4488
16.0      4476
55.0      4453
66.0      4443
49.0      4396
17.0      4361
67.0      4321
29.0      4017
68.0      3854
86.0      3676
45.0      3612
23.0      3603
18.0      3501
73.0      3486
24.0      3473
42.0      3243
76.0      2728
74.0      2694
77.0      2593
53.0      2589
21.0      2564
91.0      2430
60.0      2401
83.0      1824
8.0       1814
70.0      1810
71.0      1798
14.0      1782
85.0      1778
82.0      

In [43]:
def calc_rating(score):
    if score >= 0 and score <= 8:
        return 0
    elif score >= 9 and score <= 16:
        return 1
    elif score >= 17 and score <= 32:
        return 2
    elif score >= 33 and score <= 68:
        return 3
    else:
        return 4

merged_data['UPDRS Rating'] = merged_data['Total UPDRS Score'].apply(calc_rating)


In [44]:
# Create rating by by Total UPDRS Score

# 0 to 8: Normal as 0
# 9 to 16: Slight as 1
# 17 to 32: Mild as 2
# 33 to 68: Moderate as 3 
# 69 and above: Severe as 4

merged_data.head(5)

Unnamed: 0,Visit ID,Visit Month,Patient ID,Protein ID,Peptide Sequence,Peptide Abundance,Protein Abundance,UPDRS Part I,UPDRS Part II,UPDRS Part III,UPDRS Part IV,On Medication,Total UPDRS Score,UPDRS Rating
931,1517_0,0,1517,O00391,NEQEQPLGQWHLS,11648.9,11648.9,11.0,6.0,25.0,5.0,On,47.0,3
932,1517_0,0,1517,O00533,GNPEPTFSWTK,63593.4,419015.0,11.0,6.0,25.0,5.0,On,47.0,3
933,1517_0,0,1517,O00533,IEIPSSVQQVPTIIK,99566.6,419015.0,11.0,6.0,25.0,5.0,On,47.0,3
934,1517_0,0,1517,O00533,KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK,16351.0,419015.0,11.0,6.0,25.0,5.0,On,47.0,3
935,1517_0,0,1517,O00533,SMEQNGPGLEYR,15566.0,419015.0,11.0,6.0,25.0,5.0,On,47.0,3


In [45]:
merged_data['On Medication'] = merged_data['On Medication'].replace({'On': 1, 'Off': 0})

In [46]:
merged_data.head(3)

Unnamed: 0,Visit ID,Visit Month,Patient ID,Protein ID,Peptide Sequence,Peptide Abundance,Protein Abundance,UPDRS Part I,UPDRS Part II,UPDRS Part III,UPDRS Part IV,On Medication,Total UPDRS Score,UPDRS Rating
931,1517_0,0,1517,O00391,NEQEQPLGQWHLS,11648.9,11648.9,11.0,6.0,25.0,5.0,1,47.0,3
932,1517_0,0,1517,O00533,GNPEPTFSWTK,63593.4,419015.0,11.0,6.0,25.0,5.0,1,47.0,3
933,1517_0,0,1517,O00533,IEIPSSVQQVPTIIK,99566.6,419015.0,11.0,6.0,25.0,5.0,1,47.0,3


In [47]:
merged_data.drop(['Peptide Sequence', 'Protein ID'], axis=1, inplace=True)

In [54]:
merged_data.tail(5)

Unnamed: 0,Visit ID,Visit Month,Patient ID,Peptide Abundance,Protein Abundance,UPDRS Part I,UPDRS Part II,UPDRS Part III,UPDRS Part IV,On Medication,Total UPDRS Score,UPDRS Rating
940831,55096_108,108,55096,33929.8,123192.0,5.0,6.0,46.0,0.0,0,57.0,3
940832,55096_108,108,55096,68513.3,123192.0,5.0,6.0,46.0,0.0,0,57.0,3
940833,55096_108,108,55096,101061.0,101061.0,5.0,6.0,46.0,0.0,0,57.0,3
940834,55096_108,108,55096,18108.9,18108.9,5.0,6.0,46.0,0.0,0,57.0,3
940835,55096_108,108,55096,14165.3,14165.3,5.0,6.0,46.0,0.0,0,57.0,3


In [None]:
merged_data.drop(['Visit Month', 'Patient ID'], axis=1, inplace=True)

In [55]:
# Splitting the dataset into X and y
X = merged_data.drop(['UPDRS Rating'], axis=1)
y = merged_data['UPDRS Rating']

In [56]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(merged_data.drop('UPDRS Rating', axis=1), merged_data['UPDRS Rating'], test_size=0.2, random_state=42)

# create a random forest regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)


In [57]:
# fit the model to the training data
rf.fit(X_train, y_train)

# make predictions on the testing data
y_pred = rf.predict(X_test)

In [58]:
# evaluate the model's performance
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print('MSE:', mse)


MSE: 0.0


In [60]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print("R-squared value:", r2)


R-squared value: 1.0


In [88]:
# Create submission dataframe
submission_df = pd.DataFrame({
    'prediction_id': merged_data['Visit ID'].astype(str) + '_' + merged_data['Visit Month'].astype(str)+ '_months',
    'rating': merged_data['UPDRS Rating'].astype(int)
})

# Group by prediction_id and take the mean rating
submission_df = submission_df.groupby(['prediction_id'])['rating'].mean().reset_index()

# Set prediction_id as index
submission_df = submission_df.set_index('prediction_id')

# Save submission CSV file
submission_df.to_csv('submission4.csv')


In [89]:
sub = pd.read_csv('submission4.csv')

In [90]:
sub.head(3)

Unnamed: 0,prediction_id,rating
0,10138_12_12_months,3
1,10138_24_24_months,3
2,10138_36_36_months,2


### Conclusion

Based on the analysis and modeling, we can conclude that the protein abundance and peptide abundance are highly correlated with UPDRS rating, which makes them important predictors of the disease severity. We have also identified that the UPDRS Part III score is the strongest predictor of UPDRS rating.

The Random Forest model was used to predict the UPDRS rating based on the protein and peptide abundance levels, and we achieved an excellent model performance with an R-squared value of 1.0, indicating a perfect fit.

Finally, we created a submission file with the predicted UPDRS rating for each patient visit using the trained model. This can be used to assist medical professionals in predicting the disease progression and to monitor the effectiveness of treatments for Parkinson's disease patients.