#       **Pre-processing**

In this notebook, I will be performing pre-processing for the project. I will first perform scaling of the features and then attempt to use feature selection using the Boruta method to select the most important protein/peptides for predicting the updrs scores. This will all be done in the training dataset. There are 234 patients, with protein/peptide expression data at different months. There are protein abundance values for 227 proteins and peptide abundance measurements for 671 peptides. They have been merged in the previous notebook. 
I will be scaling the testing dataset using values  from the training dataset.

In [34]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import plotly.express as px
from boruta import BorutaPy
from sklearn.ensemble import RandomForestRegressor

Let us import are datasets which we will be working on. Our cleaned and wrangled protein and peptide dataset and the clinical dataset

In [35]:
Training_protein_peptide_final=pd.read_csv("Protein_peptide_training_final.csv")
Training_clinical=pd.read_csv('Training_clinical_cleaned.csv')

Visualising the loaded datasets

In [36]:
Training_protein_peptide_final.head()

Unnamed: 0,visit_id,O00391,O00533,O00584,O14498,O14773,O14791,O15240,O15394,O43505,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
0,10053_0,13.152328,18.617988,0.0,0.0,12.803843,11.286465,16.340874,13.88356,17.352311,...,17.625951,0.0,22.069672,16.241585,19.153322,16.227046,16.669826,19.01624,0.0,12.815243
1,10053_12,13.353174,18.732598,0.0,0.0,0.0,0.0,17.588693,13.882175,17.325692,...,17.616901,0.0,22.254002,15.165272,18.44007,16.49057,16.911275,18.791961,15.58877,14.628719
2,10053_18,13.692147,18.952724,12.799071,14.582007,0.0,11.21232,16.948846,13.991664,17.35902,...,17.75191,0.0,22.371027,15.251778,18.920042,15.947719,16.969566,18.771544,15.676979,14.374204
3,10138_12,13.621159,18.915847,13.161929,14.730974,14.458028,12.554565,17.254078,15.735196,17.638302,...,17.523148,13.20361,21.895146,15.557054,18.325455,16.454783,16.987753,19.074915,16.002679,13.269854
4,10138_24,13.551131,18.994072,12.135232,14.069265,14.829346,11.380001,17.205803,15.675574,17.878027,...,17.653594,12.635979,21.747882,16.09475,18.922123,16.304196,16.770548,0.0,15.798107,12.259476


In [37]:
Training_clinical.head()

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication,visit_month_diff,visit_month_diff_min,median_updrs_1,median_updrs_2,median_updrs_3,median_updrs_4
0,55_0,55,0,10.0,6.0,15.0,0.0,,,3.0,7.0,7.0,24.0,0.0
1,55_3,55,3,10.0,7.0,25.0,0.0,,3.0,3.0,7.0,7.0,24.0,0.0
2,55_6,55,6,8.0,10.0,34.0,0.0,,3.0,3.0,7.0,7.0,24.0,0.0
3,55_9,55,9,8.0,9.0,30.0,0.0,On,3.0,3.0,7.0,7.0,24.0,0.0
4,55_12,55,12,10.0,10.0,41.0,0.0,On,3.0,3.0,7.0,7.0,24.0,0.0


Drop the median UPDRS score columns from the training clinical dataset

In [38]:
Training_clinical=Training_clinical.drop(columns=["median_updrs_1","median_updrs_2","median_updrs_3","median_updrs_4"])

Combine the clinical and the protein/peptide abundance datasets

In [39]:
training_protein_peptide_clinical = Training_clinical.merge(Training_protein_peptide_final, on="visit_id",how="inner")
training_protein_peptide_clinical.head()

Unnamed: 0,visit_id,patient_id,visit_month,updrs_1,updrs_2,updrs_3,updrs_4,upd23b_clinical_state_on_medication,visit_month_diff,visit_month_diff_min,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
0,55_0,55,0,10.0,6.0,15.0,0.0,,,3.0,...,17.61797,14.009505,21.861462,16.705821,19.147352,17.000913,17.339528,18.73828,15.498388,13.86287
1,55_6,55,6,8.0,10.0,34.0,0.0,,3.0,3.0,...,17.384303,13.688119,21.974045,16.79087,18.973823,16.659439,17.141778,18.804645,15.289432,14.337615
2,55_12,55,12,10.0,10.0,41.0,0.0,On,3.0,3.0,...,17.822347,14.125559,22.384201,16.827318,19.441143,17.063216,17.471699,18.786771,15.739915,14.414758
3,55_36,55,36,17.0,18.0,51.0,0.0,On,6.0,3.0,...,17.499425,14.181502,21.34281,16.472578,19.373398,16.972453,17.635945,18.927584,15.688051,13.770426
4,942_6,942,6,8.0,2.0,21.0,0.0,,3.0,3.0,...,17.787966,12.643811,0.0,15.813065,18.87553,16.287734,16.281602,19.128931,15.550921,13.936095


In [43]:
X_train=training_protein_peptide_clinical.drop(columns=['updrs_1',"updrs_2","updrs_3","updrs_4"])

In [41]:
y_train=training_protein_peptide_clinical[['updrs_1',"updrs_2","updrs_3","updrs_4"]]

We will also engineer our test dataset to make it similar to our training dataset and impute test clinical data values based on training dataset values.

In [9]:
Testing_protein=pd.read_csv('testing_data/example_test_files/test_proteins.csv')
Testing_peptide=pd.read_csv('testing_data/example_test_files/test_peptides.csv')
Testing_clinical=pd.read_csv('testing_data/example_test_files/test.csv')


Log2 normalise the testing data values. This was done for the training dataset in the EDA notebook

In [10]:
Testing_protein_peptide=  Testing_peptide.merge(Testing_protein,  on=['visit_id','visit_month','patient_id','UniProt'])
Testing_protein_peptide["Normalised_PeptideAbundance"]=np.log2(Testing_protein_peptide.loc[:,"PeptideAbundance"])
Testing_protein_peptide["Normalised_NPX"]=np.log2(Testing_protein_peptide.NPX)
Testing_protein_peptide.head()

Unnamed: 0,visit_id,visit_month,patient_id,UniProt,Peptide,PeptideAbundance,group_key_x,NPX,group_key_y,Normalised_PeptideAbundance,Normalised_NPX
0,50423_0,0,50423,O00391,AHFSPSNIILDFPAAGSAAR,22226.3,0,33127.9,0,14.43998,15.015759
1,50423_0,0,50423,O00391,NEQEQPLGQWHLS,10901.6,0,33127.9,0,13.412252,15.015759
2,50423_0,0,50423,O00533,GNPEPTFSWTK,51499.4,0,490742.0,0,15.652268,18.904605
3,50423_0,0,50423,O00533,IEIPSSVQQVPTIIK,125492.0,0,490742.0,0,16.937236,18.904605
4,50423_0,0,50423,O00533,KPQSAVYSTGSNGILLC(UniMod_4)EAEGEPQPTIK,23174.2,0,490742.0,0,14.500232,18.904605


In [11]:
Testing_protein_peptide_select_drop_protein=Testing_protein_peptide.loc[:,["visit_id","UniProt","Normalised_NPX","visit_month"]]
Testing_protein_peptide_select_drop_protein.head()

Unnamed: 0,visit_id,UniProt,Normalised_NPX,visit_month
0,50423_0,O00391,15.015759,0
1,50423_0,O00391,15.015759,0
2,50423_0,O00533,18.904605,0
3,50423_0,O00533,18.904605,0
4,50423_0,O00533,18.904605,0


In [12]:
protein=Testing_protein_peptide_select_drop_protein.pivot_table(index=["visit_id"],columns="UniProt",values="Normalised_NPX")
protein=protein.fillna(0)
columns=protein.columns
protein.head()

UniProt,O00391,O00533,O00584,O14498,O14773,O14791,O15031,O15240,O15394,O43505,...,Q9HDC9,Q9NQ79,Q9NYU2,Q9UBR2,Q9UBX5,Q9UHG2,Q9UKV8,Q9UNU6,Q9Y646,Q9Y6R7
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3342_6,13.830119,19.234453,15.906376,14.764358,13.868745,11.478325,0.0,17.769095,15.845073,17.662731,...,17.402121,15.502204,16.772869,14.039262,15.294165,18.311083,15.975031,14.616405,15.651537,13.8361
50423_0,15.015759,18.904605,15.412547,0.0,14.009006,11.493065,13.546376,16.923977,15.815986,17.612264,...,17.504966,15.115113,16.591958,14.481049,15.615633,17.914041,16.147583,14.352954,15.246195,15.389376


In [13]:
Testing_protein_peptide_select_peptide=Testing_protein_peptide.loc[:,["visit_id","Peptide","Normalised_PeptideAbundance","visit_month"]]
peptide=Testing_protein_peptide_select_peptide.pivot(index="visit_id",columns="Peptide",values="Normalised_PeptideAbundance")

testing_protein_peptide_final = protein.merge(peptide, on="visit_id",how="left")
testing_protein_peptide_final.head()

Unnamed: 0_level_0,O00391,O00533,O00584,O14498,O14773,O14791,O15031,O15240,O15394,O43505,...,YSSDYFQAPSDYR,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3342_6,13.830119,19.234453,15.906376,14.764358,13.868745,11.478325,0.0,17.769095,15.845073,17.662731,...,14.926745,13.338709,21.802359,15.523455,19.941847,16.186861,16.9559,17.685262,15.49583,14.427287
50423_0,15.015759,18.904605,15.412547,0.0,14.009006,11.493065,13.546376,16.923977,15.815986,17.612264,...,15.273959,13.90783,21.980359,13.893396,16.34523,16.759433,16.776369,17.756113,15.506723,13.057861


In [14]:
Testing_protein_peptide_clinical = Testing_clinical.merge(testing_protein_peptide_final, on="visit_id",how="inner")
Testing_protein_peptide_clinical=Testing_protein_peptide_clinical.set_index("visit_id")
Testing_protein_peptide_clinical.head()

Unnamed: 0_level_0,visit_month,patient_id,updrs_test,row_id,group_key,O00391,O00533,O00584,O14498,O14773,...,YSSDYFQAPSDYR,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
visit_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
50423_0,0,50423,updrs_1,50423_0_updrs_1,0,15.015759,18.904605,15.412547,0.0,14.009006,...,15.273959,13.90783,21.980359,13.893396,16.34523,16.759433,16.776369,17.756113,15.506723,13.057861
50423_0,0,50423,updrs_2,50423_0_updrs_2,0,15.015759,18.904605,15.412547,0.0,14.009006,...,15.273959,13.90783,21.980359,13.893396,16.34523,16.759433,16.776369,17.756113,15.506723,13.057861
50423_0,0,50423,updrs_3,50423_0_updrs_3,0,15.015759,18.904605,15.412547,0.0,14.009006,...,15.273959,13.90783,21.980359,13.893396,16.34523,16.759433,16.776369,17.756113,15.506723,13.057861
50423_0,0,50423,updrs_4,50423_0_updrs_4,0,15.015759,18.904605,15.412547,0.0,14.009006,...,15.273959,13.90783,21.980359,13.893396,16.34523,16.759433,16.776369,17.756113,15.506723,13.057861
3342_6,6,3342,updrs_1,3342_6_updrs_1,6,13.830119,19.234453,15.906376,14.764358,13.868745,...,14.926745,13.338709,21.802359,15.523455,19.941847,16.186861,16.9559,17.685262,15.49583,14.427287


In [15]:
X_test=Testing_protein_peptide_clinical.drop(columns=['updrs_test',"row_id","patient_id","visit_month","group_key"])
X_test = X_test[X_test.columns.intersection(X_train.columns)]

Let me use the standard scaler  to scale the features

In [16]:
X_train=X_train[X_train.columns.intersection(X_test.columns)]
scaler = StandardScaler()
X_tr_scaled = scaler.fit_transform(X_train)

In [17]:
X_test_scaled = scaler.fit_transform(X_test)

In [18]:
y_test=Testing_protein_peptide_clinical[['updrs_test']]

We have to apply the Boruta feature selection method individually for each of the four updrs scores.

In [64]:

model = RandomForestRegressor(n_estimators=1000, max_depth=5, random_state=42)

np.int = np.int32
np.float = np.float64
np.bool = np.bool_

# let's initialize Boruta
feat_selector = BorutaPy(
    verbose=2,
    estimator=model,
    n_estimators='auto'  
)

# train Boruta
# N.B.: X and y must be numpy arrays
feat_selector.fit(np.array(X_tr_scaled), np.array(y_train.updrs_1))

X_train.columns[feat_selector.support_]

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	0
Tentative: 	7
Rejected: 	1159
Iteration: 	9 / 100
Confirmed: 	1
Tentative: 	6
Rejected: 	1159
Iteration: 	10 / 100
Confirmed: 	1
Tentative: 	6
Rejected: 	1159
Iteration: 	11 / 100
Confirmed: 	1
Tentative: 	6
Rejected: 	1159
Iteration: 	12 / 100
Confirmed: 	1
Tentative: 	5
Rejected: 	1160
Iteration: 	13 / 100
Confirmed: 	1
Tentative: 	5
Rejected: 	1160
Iteration: 	14 / 100
Confirmed: 	1
Tentative: 	5
Rejected: 	1160
Iteration: 	15 / 100
Confirmed: 	1
Tentative: 	5
Rejected: 	1160
Iteration: 	16 / 100
Confirmed: 	2

Index(['P04180', 'P43121', 'FIYGGC(UniMod_4)GGNR', 'GEAGAPGEEDIQGPTK',
       'LDEVKEQVAEVR'],
      dtype='object')

In [19]:
features_UPDRS1=X_train.columns[feat_selector.support_]
pd.Series(features_UPDRS1).to_csv('features_UPDRS1', header=False, index=False)

NameError: name 'feat_selector' is not defined

In [19]:

np.int = np.int32
np.float = np.float64
np.bool = np.bool_
feat_selector_1 = BorutaPy(
    verbose=2,
    estimator=model,
    n_estimators='auto' # number of iterations to perform
)

# train Boruta
# N.B.: X and y must be numpy arrays
feat_selector_1.fit(np.array(X_tr_scaled), np.array(y_train.updrs_2))

X_train.columns[feat_selector_1.support_]
features_UPDRS2=X_train.columns[feat_selector_1.support_]
pd.Series(features_UPDRS2).to_csv('features_UPDRS2', header=False, index=False)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	0
Tentative: 	23
Rejected: 	1143
Iteration: 	9 / 100
Confirmed: 	3
Tentative: 	20
Rejected: 	1143
Iteration: 	10 / 100
Confirmed: 	3
Tentative: 	20
Rejected: 	1143
Iteration: 	11 / 100
Confirmed: 	3
Tentative: 	20
Rejected: 	1143
Iteration: 	12 / 100
Confirmed: 	4
Tentative: 	17
Rejected: 	1145
Iteration: 	13 / 100
Confirmed: 	4
Tentative: 	17
Rejected: 	1145
Iteration: 	14 / 100
Confirmed: 	4
Tentative: 	17
Rejected: 	1145
Iteration: 	15 / 100
Confirmed: 	4
Tentative: 	17
Rejected: 	1145
Iteration: 	16 / 100
Confi

In [21]:

np.int = np.int32
np.float = np.float64
np.bool = np.bool_
feat_selector_3 = BorutaPy(
    verbose=2,
    estimator=model,
    n_estimators='auto' # number of iterations to perform
)

# train Boruta
# N.B.: X and y must be numpy arrays
feat_selector_3.fit(np.array(X_tr_scaled), np.array(y_train.updrs_3))

X_train.columns[feat_selector_3.support_]
features_UPDRS3=X_train.columns[feat_selector_3.support_]
pd.Series(features_UPDRS3).to_csv('features_UPDRS3', header=False, index=False)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	0
Tentative: 	33
Rejected: 	1133
Iteration: 	9 / 100
Confirmed: 	3
Tentative: 	30
Rejected: 	1133
Iteration: 	10 / 100
Confirmed: 	3
Tentative: 	30
Rejected: 	1133
Iteration: 	11 / 100
Confirmed: 	3
Tentative: 	30
Rejected: 	1133
Iteration: 	12 / 100
Confirmed: 	4
Tentative: 	29
Rejected: 	1133
Iteration: 	13 / 100
Confirmed: 	4
Tentative: 	29
Rejected: 	1133
Iteration: 	14 / 100
Confirmed: 	4
Tentative: 	29
Rejected: 	1133
Iteration: 	15 / 100
Confirmed: 	4
Tentative: 	29
Rejected: 	1133
Iteration: 	16 / 100
Confi

In [22]:
model = RandomForestRegressor(n_estimators=1000, max_depth=5, random_state=42)feat_selector_4 = BorutaPy(
    verbose=2,
    estimator=model,
    n_estimators='auto' # number of iterations to perform
)

# train Boruta
# N.B.: X and y must be numpy arrays
feat_selector_4.fit(np.array(X_tr_scaled), np.array(y_train.updrs_4))

X_train.columns[feat_selector_4.support_]
features_UPDRS4=X_train.columns[feat_selector_4.support_]
pd.Series(features_UPDRS4).to_csv('features_UPDRS4', header=False, index=False)

Iteration: 	1 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	2 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	3 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	4 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	5 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	6 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	7 / 100
Confirmed: 	0
Tentative: 	1166
Rejected: 	0
Iteration: 	8 / 100
Confirmed: 	0
Tentative: 	2
Rejected: 	1164
Iteration: 	9 / 100
Confirmed: 	0
Tentative: 	2
Rejected: 	1164
Iteration: 	10 / 100
Confirmed: 	0
Tentative: 	2
Rejected: 	1164
Iteration: 	11 / 100
Confirmed: 	0
Tentative: 	2
Rejected: 	1164
Iteration: 	12 / 100
Confirmed: 	1
Tentative: 	1
Rejected: 	1164
Iteration: 	13 / 100
Confirmed: 	1
Tentative: 	1
Rejected: 	1164
Iteration: 	14 / 100
Confirmed: 	1
Tentative: 	1
Rejected: 	1164
Iteration: 	15 / 100
Confirmed: 	1
Tentative: 	1
Rejected: 	1164
Iteration: 	16 / 100
Confirmed: 	1

The important features that are selected by BorutaPy will now be used in predicting the UPDRS scores

Save the training and test dataset files

In [44]:
X_train.to_csv("X_train.csv",index=True)
X_test.to_csv("X_test.csv",index=True)
y_train.to_csv("y_train.csv",index=True)
y_test.to_csv("y_test.csv",index=True)
np.savetxt(' X_test_scaled.txt',  X_test_scaled)
np.savetxt(' X_tr_scaled.txt',  X_test_scaled)



Unnamed: 0,O00391,O00533,O00584,O14498,O14773,O14791,O15240,O15394,O43505,O60888,...,YSLTYIYTGLSK,YTTEIIK,YVGGQEHFAHLLILR,YVM(UniMod_35)LPVADQDQC(UniMod_4)IR,YVMLPVADQDQC(UniMod_4)IR,YVNKEIQNAVNGVK,YWGVASFLQK,YYC(UniMod_4)FQGNQFLR,YYTYLIMNK,YYWGGQYTWDMAK
0,13.458189,19.482331,15.272695,15.341759,14.931014,12.037104,17.439693,15.940731,18.346791,17.348192,...,17.61797,14.009505,21.861462,16.705821,19.147352,17.000913,17.339528,18.73828,15.498388,13.86287
1,13.684266,19.266057,15.10414,15.333679,14.678375,12.108662,17.337674,15.933126,18.083858,17.3781,...,17.384303,13.688119,21.974045,16.79087,18.973823,16.659439,17.141778,18.804645,15.289432,14.337615
2,13.89724,19.636587,15.34606,15.27915,14.90612,12.084676,17.204886,16.031079,18.342565,17.206041,...,17.822347,14.125559,22.384201,16.827318,19.441143,17.063216,17.471699,18.786771,15.739915,14.414758
3,13.72396,19.523884,15.393689,15.408847,15.035211,12.389916,16.624795,16.044492,18.276352,16.907149,...,17.499425,14.181502,21.34281,16.472578,19.373398,16.972453,17.635945,18.927584,15.688051,13.770426
4,13.453618,18.607901,14.329025,14.933456,12.591892,11.324389,17.292447,15.406175,17.336681,16.379796,...,17.787966,12.643811,0.0,15.813065,18.87553,16.287734,16.281602,19.128931,15.550921,13.936095
