# Feature generation for the human proteome

This notebook includes:
- Feature generation
- Data normalization (amino acid count values to proportions)
- Log transformation of features
- Removal of non-MS detectable proteins from the dataset
       
Output dataset:
- features_human_proteome.csv
- features_human_proteome_MS_filtering.csv

## Import libraries

In [1]:
import json
import numpy as np
import os
import pandas as pd

from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis

## Define paths

In [2]:
Data_path = os.path.dirname(os.getcwd()) + '/Data'

## Feature generation

### Basic & structural property predictions (NetSurfP-2.0 predictions)

Dataset already contains:
- Sequence length
- Molecular weight
- Amino acid count
- NetSurfP-2.0 predictions

In [3]:
# import NSP2 predictions for human proteome (does not include TITIN as NSP2 cannot run such a big protein)
df_features = pd.read_csv(Data_path + '/Features/NSP2_complete.tab', sep='\t', engine='python')  
df_features = df_features.drop(columns=['PDB', 'q3_H', 'q3_E', 'q3_C', 'on_surface']) # drop ununsed columns

### Exposed amino acid count

In [4]:
# if RSA > 0.4 AA residue is exposed
nsp_aa_exposed = pd.read_csv(Data_path + '/Features/NSP2_exposed.csv', sep=',', engine='python') 
df_features = pd.merge(df_features, nsp_aa_exposed, on=['id'], how="left") 
# nps_aa_exposed contains only 19873 proteins, some get lost if inner merge is performed, dataframe contains some NAs

### Solubility-weighted index (SWI)

Bhandari BK, Gardner PP, Lim CS. Solubility-Weighted Index: fast and accurate prediction of protein solubility. Bioinformatics. 2020 Sep 15; 36(18):4691-4698. doi: https://doi.org/10.1093/bioinformatics/btaa578

In [5]:
weights = {'A': 0.8356471476582918,
           'C': 0.5208088354857734,
           'U': 0.5208088354857734, # since FASTA sequences contain U 
           'E': 0.9876987431418378,
           'D': 0.9079044671339564,
           'G': 0.7997168496420723,
           'F': 0.5849790194237692,
           'I': 0.6784124413866582,
           'H': 0.8947913996466419,
           'K': 0.9267104557513497,
           'M': 0.6296623675420369,
           'L': 0.6554221515081433,
           'N': 0.8597433107431216,
           'Q': 0.789434648348208,
           'P': 0.8235328714705341,
           'S': 0.7440908318492778,
           'R': 0.7712466317693457,
           'T': 0.8096922697856334,
           'W': 0.6374678690957594,
           'V': 0.7357837119163659,
           'Y': 0.6112801822947587}

A = 81.0581
B = -62.7775

def SWI(df):

    df['SWI'] = df['fasta_sequence'].apply(
        lambda x: np.mean([weights[i] for i in x]))
    df['Probability_solubility'] = 1/(1 + np.exp(-(A*df['SWI'] + B)))

    return df

In [6]:
# run SWI calculation, add SWI columns to feature dataset
df_features = SWI(df_features.copy())
# drop SWI column
df_features = df_features.drop(columns=['SWI']) 

### Aggregation propensity

Sánchez de Groot N, Pallarés I, Avilés FX, Vendrell J, Ventura S. Prediction of "hot spots" of aggregation in disease-linked polypeptides. BMC Struct Biol. 2005 Sep 30;5:18. doi: https://doi.org/10.1186/1472-6807-5-18

In [7]:
propensities = {'I': 1.822,
                'F': 1.754,
                'V': 1.594,
                'L': 1.380,
                'Y': 1.159,
                'W': 1.037,
                'M': 0.910,
                'C': 0.604,
                'U': 0.604,
                'A': -0.036,
                'T': -0.159,
                'S': -0.294,
                'P': -0.334,
                'G': -0.535,
                'K': -0.931,
                'H': -1.033,
                'Q': -1.231,
                'R': -1.240,
                'N': -1.302,
                'E': -1.412,
                'D': -1.836}

def aggregation_score(df):

    df['Aggregation_propensity'] = df['fasta_sequence'].apply(
        lambda x: np.sum([propensities[i] for i in x]))

    return df

In [8]:
# run aggregation propensity calculation, add aggregation column to feature dataset
df_features = aggregation_score(df_features.copy())

### Simple protein analysis by Biopython

Bio.SeqUtils.ProtParam module (https://biopython.org/wiki/ProtParam)
- Aromaticity: Calculates the aromaticity value of a protein according to Lobry, 1994. It is simply the relative frequency of Phe+Trp+Tyr.
- Instability index: Implementation of the method of Guruprasad et al. 1990 to test a protein for stability. Any value above 40 means the protein is unstable (has a short half life). 
- Gravy: Calculate the gravy according to Kyte and Doolittle.
- Isoelectric point: Calculate the isoelectric point.
- Charge at pH 5: Calculate the charge of a protein at given pH.
- Charge at pH 7: Calculate the charge of a protein at given pH.

In [9]:
# Generate a list and replace U with C, because biopython gives an error otherwise
sequences = list(df_features['fasta_sequence'])
sequences = [sequence.replace('U', 'C') for sequence in sequences]

# create empty lists
aromaticity_list = []
instability_index_list = []
gravy_list = []
isoelectric_point_list = []
charge_at_7_list = []
charge_at_5_list = []

In [10]:
for sequence in sequences:
    analysis = ProteinAnalysis(str(sequence))
    
    #calculate different values
    aromaticity = analysis.aromaticity()
    instability_index = analysis.instability_index()
    gravy_score = analysis.gravy()
    isoelectric_point = analysis.isoelectric_point()
    charge_at_7 = analysis.charge_at_pH(7)
    charge_at_5 = analysis.charge_at_pH(5)
    
    #append the list
    aromaticity_list.append(aromaticity)
    instability_index_list.append(instability_index)
    gravy_list.append(gravy_score)
    isoelectric_point_list.append(isoelectric_point)
    charge_at_7_list.append(charge_at_7)
    charge_at_5_list.append(charge_at_5)

In [11]:
# add a column
df_features['Aromaticity'] = aromaticity_list 
df_features['Instability_index'] = instability_index_list 
df_features['Gravy'] = gravy_list 
df_features['Isoelectric_point'] = isoelectric_point_list 
df_features['Charge_at_7'] = charge_at_7_list 
df_features['Charge_at_5'] = charge_at_5_list 

### HSP annotation

In [12]:
HSP = pd.read_csv(Data_path + '/Features/UniProt/HSP.tab', sep='\t', engine='python')

HSP_list = set(HSP["Entry"])
df_features["HSP"] = np.where(df_features['id'].isin(HSP_list), 1, 0) 

### Post-translational modification

#### UniProt annotations (keywords)

In [13]:
def up_import(ptm):
    
    UP_list = []
    
    f = open(Data_path + '/Features/UniProt/' + ptm + '.list', 'r')
    
    # import data
    while True:
        line = f.readline()
        line = line.rstrip('\n')
        UP_list.append(line) 
        # end of file is reached
        if not line:
            break
    
    return UP_list 

In [14]:
# define UniProt keywords
UP = ['PTM', 'Acetylation', 'Citrullination' , 'Glycoprotein', 'GPI-anchor', 'Methylation',  'Myristate', 'Nitration', 
      'Palmitate', 'Prenylation', 'Phosphoprotein', 'S-nitrosylation', 'Ubl_conjugation'] # dropped Lipoprotein keyword

# create dict of all PTM files
UP_dict = {}
for i in UP:
    UP_dict[i] = up_import(i)

In [15]:
# 0/1 annotation
UP_only = ['PTM', 'Citrullination', 'GPI-anchor', 'Nitration', 'Prenylation']

print('Number of annotations in feature dataset (UniProt keywords)')

for i in UP_only:
    column_name = str(i) + '_Uniprot'
    df_features[column_name] = np.where(df_features['id'].isin(UP_dict[i]), 1, 0) 
    print(str(i) + ':', sum(df_features[column_name]))

Number of annotations in feature dataset (UniProt keywords)
PTM: 13918
Citrullination: 68
GPI-anchor: 139
Nitration: 48
Prenylation: 172


#### UniProt annotations (text mining)

In [16]:
# load PTM comment table
PTM_comments = pd.read_csv(Data_path + '/Features/UniProt/PTM_comments.tab', sep='\t', engine='python')

# merge PTM comment table with feature dataset
PTM_comments.rename(columns={'Entry':'id', 'Post-translational modification':'PTM_comments'}, inplace=True)
df_features = df_features.merge(PTM_comments, on='id', how='left')

In [17]:
print('Number of annotations in feature dataset (UniProt textmining)')

df_features['ISGylation_Uniprot'] = np.where(df_features['PTM_comments'].str.contains('ISGy', na=False), 1, 0)
print("ISGylation:", sum(df_features['ISGylation_Uniprot']))
df_features['NEDDylation_Uniprot'] = np.where(df_features['PTM_comments'].str.contains('NEDD', na=False), 1, 0)
print("NEDDylation:", sum(df_features['NEDDylation_Uniprot']))

Number of annotations in feature dataset (UniProt textmining)
ISGylation: 42
NEDDylation: 41


In [18]:
df_features.drop(columns=['PTM_comments'], inplace=True) 

#### PhosphoSitePlus annotations

In [19]:
def psp_import(ptm):
    
    # import data
    loc = Data_path + '/Features/PhosphositePlus/' + ptm + '_psp.tab'
    df = pd.read_csv(loc, sep='\t', engine='python')
    
    # filter for human entries only
    df = df[df['ORGANISM'] == 'human']
    
    return df 
    
def psp_to_list(df):    
    
    # create list of UniProt IDs
    UP_list = list(df['ACC_ID'])
    print('\t Number of entries:', len(UP_list))
    
    # filter out duplicates
    UP_list_unique = list(set(UP_list))
    print('\t Number of unique entries:', len(UP_list_unique))
    
    return UP_list_unique


def psp_to_count(df):    
    
    # count entries per UniProt ID
    df['COUNT'] = df.groupby('ACC_ID')['ACC_ID'].transform('count')
    df = df.drop_duplicates(subset=['ACC_ID'], keep='first')
    
    return df[['ACC_ID', 'COUNT']]

In [20]:
psp = ['Ubiquitination', 'Phosphorylation', 'Methylation', 'Acetylation', 'Sumoylation', 'O-GlcNAc', 'O-GalNAc']
# create dict of all PTM files
psp_dict = {}

for i in psp:
    print(i)
    psp_dict[i] = psp_to_list(psp_import(i))

Ubiquitination
	 Number of entries: 97777
	 Number of unique entries: 12435
Phosphorylation
	 Number of entries: 239544
	 Number of unique entries: 19834
Methylation
	 Number of entries: 16374
	 Number of unique entries: 5691
Acetylation
	 Number of entries: 22824
	 Number of unique entries: 7253
Sumoylation
	 Number of entries: 8415
	 Number of unique entries: 2667
O-GlcNAc
	 Number of entries: 441
	 Number of unique entries: 173
O-GalNAc
	 Number of entries: 2102
	 Number of unique entries: 478


In [21]:
Glycosylation_PSP = list(set(psp_dict['O-GlcNAc'] + psp_dict['O-GalNAc']))

#### iPTMnet annotations

In [22]:
iPTM = Data_path + '/Features/iPTMnet_annotations.txt'  
iPTM = pd.read_csv(iPTM, sep='\t', engine='python')

iPTM = iPTM[iPTM['organism'] == 'Homo sapiens (Human)']

# count all glycosylation sites together
iPTM['ptm_type'] = iPTM['ptm_type'].replace(['N-GLYCOSYLATION', 'O-GLYCOSYLATION', 'C-GLYCOSYLATION', 'S-GLYCOSYLATION'], 
                    'GLYCOSYLATION')

iPTM['ptm_type'].value_counts()

PHOSPHORYLATION    139966
ACETYLATION          4841
GLYCOSYLATION        3348
S-NITROSYLATION      2376
METHYLATION          1007
UBIQUITINATION        523
MYRISTOYLATION        125
SUMOYLATION            35
Name: ptm_type, dtype: int64

In [23]:
iPTM_list = ['PHOSPHORYLATION', 'ACETYLATION', 'GLYCOSYLATION', 'S-NITROSYLATION', 'METHYLATION', 'UBIQUITINATION', 
        'MYRISTOYLATION', 'SUMOYLATION'] 

def iPTM_to_list(ptm): 
    
    iPTM_subset = iPTM[iPTM['ptm_type'] == ptm]
    print('\t Number of entries:', len(iPTM_subset))
    
    iPTM_list = list(set(list(iPTM_subset['substrate_UniProtAC'])))
    print('\t Number of unique entries:', len(iPTM_list))
    
    return iPTM_list

In [24]:
# create dict of all PTM files
iPTM_dict = {}

for i in iPTM_list:
    print(i)
    iPTM_dict[i] = iPTM_to_list(i)

PHOSPHORYLATION
	 Number of entries: 139966
	 Number of unique entries: 15228
ACETYLATION
	 Number of entries: 4841
	 Number of unique entries: 2970
GLYCOSYLATION
	 Number of entries: 3348
	 Number of unique entries: 1160
S-NITROSYLATION
	 Number of entries: 2376
	 Number of unique entries: 1213
METHYLATION
	 Number of entries: 1007
	 Number of unique entries: 471
UBIQUITINATION
	 Number of entries: 523
	 Number of unique entries: 183
MYRISTOYLATION
	 Number of entries: 125
	 Number of unique entries: 120
SUMOYLATION
	 Number of entries: 35
	 Number of unique entries: 25


#### SwissPalm (Palmitoylation database)

In [25]:
SwissPalm = pd.read_csv(Data_path + '/Features/SwissPalm_proteins.txt', sep='\t', engine='python')

In [26]:
def swisspalm_to_list(df): 
    
    # filter for human entries only
    df = df[df['organism'] == 'Homo sapiens']
    
    # create list of UniProt IDs
    UP_list = list(df['uniprot_ac'])
    
    # filter out duplicates
    UP_list_unique = list(set(UP_list))
    print('Number of unique entries:', len(UP_list_unique))
    
    return UP_list_unique

In [27]:
SwissPalm_list = swisspalm_to_list(SwissPalm)

Number of unique entries: 4587


#### Combined PTM annotation

In [28]:
# create list of combined database annotations
Acetylation_all = set(UP_dict["Acetylation"] + iPTM_dict["ACETYLATION"] + psp_dict["Acetylation"])
Glycosylation_all = set(UP_dict["Glycoprotein"] + iPTM_dict["GLYCOSYLATION"] + Glycosylation_PSP)
Methylation_all = set(UP_dict["Methylation"] + iPTM_dict["METHYLATION"] + psp_dict["Methylation"])
Myristoylation_all = set(UP_dict["Myristate"] + iPTM_dict["MYRISTOYLATION"])
Nitrosylation_all = set(UP_dict["S-nitrosylation"] + iPTM_dict["S-NITROSYLATION"])                  
Palmitoylation_all = set(UP_dict["Palmitate"] + SwissPalm_list)
Phosphorylation_all = set(UP_dict["Phosphoprotein"] + iPTM_dict["PHOSPHORYLATION"] + psp_dict["Phosphorylation"])
SUMOylation_all = set(iPTM_dict["SUMOYLATION"] + psp_dict["Sumoylation"])
Ubiquitination_all = set(iPTM_dict["UBIQUITINATION"] + psp_dict["Ubiquitination"])

df_features['Acetylation_all'] = np.where(df_features['id'].isin(Acetylation_all), 1, 0)
df_features['Glycosylation_all'] = np.where(df_features['id'].isin(Glycosylation_all), 1, 0)
df_features['Methylation_all'] = np.where(df_features['id'].isin(Methylation_all), 1, 0)
df_features['Myristoylation_all'] = np.where(df_features['id'].isin(Myristoylation_all), 1, 0)
df_features['Nitrosylation_all'] = np.where(df_features['id'].isin(Nitrosylation_all), 1, 0)
df_features['Palmitoylation_all'] = np.where(df_features['id'].isin(Palmitoylation_all), 1, 0)
df_features['Phosphorylation_all'] = np.where(df_features['id'].isin(Phosphorylation_all), 1, 0)
df_features['SUMOylation_all'] = np.where(df_features['id'].isin(SUMOylation_all), 1, 0)
df_features['Ubiquitination_all'] = np.where(df_features['id'].isin(Ubiquitination_all), 1, 0)

In [29]:
print("Number of annotations in feature dataset")
print("Acetylation:", sum(df_features['Acetylation_all']))
print("Glycosylation:", sum(df_features['Glycosylation_all']))
print("Methylation:", sum(df_features['Methylation_all']))
print("Myristoylation:", sum(df_features['Myristoylation_all']))
print("Nitrosylation:", sum(df_features['Nitrosylation_all']))
print("Palmitoylation:", sum(df_features['Palmitoylation_all']))
print("Phosphorylation:", sum(df_features['Phosphorylation_all']))
print("Palmitoylation:", sum(df_features['Palmitoylation_all']))
print("Ubiquitination:", sum(df_features['Ubiquitination_all']))

Number of annotations in feature dataset
Acetylation: 7950
Glycosylation: 4892
Methylation: 5683
Myristoylation: 196
Nitrosylation: 1223
Palmitoylation: 3705
Phosphorylation: 17524
Palmitoylation: 3705
Ubiquitination: 11709


#### MusiteDeep (PTM prediction)

Wang D, Liu D, Yuchi J, He F, Jiang Y, Cai S, Li J, Xu D. MusiteDeep: a deep-learning based webserver for protein post-translational modification site prediction and visualization. Nucleic Acids Res. 2020 Jul 2;48(W1):W140-W146. doi: https://doi.org/10.1093/nar/gkaa275

In [30]:
# load PTM prediction dataset
MSP = pd.read_csv(Data_path + '/Features/PTM_MusiteDeep_75.csv', sep=',', engine='python')
MSP.rename(columns={'ID':'id'}, inplace=True)

# merge PTM predictions with feature dataset
df_features = df_features.merge(MSP, how='left', on='id')

# change any NA values to 0
df_features = df_features.fillna(0)

# convert PTM count to 0/1 annotation
df_features['PTM_count'] = np.where(df_features['PTM_count'] > 0, 1, 0)
# rename columns
df_features.rename(columns = {'PTM_count':'PTM_MSD', 'Acetyllysine_MSD':'Acetylation_MSD'}, inplace=True)
# drop MSD predictions with no annotation 
df_features = df_features.drop(columns=['Hydroxylation_MSD', 'Pyrrolidone_carboxylic_acid_MSD']) 

### PROSITE domains

- Coiled coil domain
- EGF
- RRM
- RAS profile
- WW domain

In [31]:
PROSITE_path = Data_path + '/Features/PROSITE/matched_domains/'
files_PROSITE = [f for f in os.listdir(PROSITE_path) if os.path.isfile(os.path.join(PROSITE_path, f))]

for file in files_PROSITE[0:]:
    domain = ''.join(file.split('.')[:-1])
    file = pd.read_csv(PROSITE_path + file, sep='\t', engine='python', header=None)
    domain_set = set(file[0])
    df_features[domain] = np.where(df_features['id'].isin(domain_set), 1, 0)

### Transmembrane annotations

In [32]:
# import data (tm=transmembrane)
tm = Data_path + '/Features/UniProt/TM_proteins.tab'
tm = pd.read_csv(tm, sep='\t', engine='python')
tm_proteins = list(tm['Entry'])

# add a column 1/0
df_features['transmembrane'] = np.where(df_features['id'].isin(tm_proteins), 1, 0)
print('Number of transmembrane proteins in feature dataset:', sum(df_features['transmembrane']))

Number of transmembrane proteins in feature dataset: 6316


### Transmembrane predictions (TMHMM)

In [33]:
# import data (tm=transmembrane)
TMHMM = Data_path + '/Features/TMHMM_tmp.csv'
TMHMM = pd.read_csv(TMHMM, sep=',')

TMHMM_list = list(TMHMM['id'])
# add a column 1/0
df_features['TMHMM'] = np.where(df_features['id'].isin(TMHMM_list), 1, 0)
print('Number of transmembrane proteins predicted by TMHMM:', sum(df_features['TMHMM']))

Number of transmembrane proteins predicted by TMHMM: 5279


## Data preprocessing 

### Normalization

In [34]:
df_normalized = df_features.copy()

features_count = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y',
                  'hydr_count', 'polar_count']

# normalize amino acid count features by length
for feature in features_count:
    df_normalized[feature] = df_normalized[feature]/df_normalized.length

In [35]:
features_exposed = ['A_exposed', 'C_exposed', 'D_exposed', 'E_exposed', 'F_exposed', 'G_exposed', 'H_exposed', 
                    'I_exposed', 'K_exposed', 'L_exposed', 'M_exposed', 'N_exposed', 'P_exposed', 'Q_exposed', 
                    'R_exposed', 'S_exposed', 'T_exposed', 'V_exposed', 'W_exposed', 'Y_exposed']

hydrophobic = ['A_exposed', 'C_exposed', 'F_exposed', 'L_exposed', 'I_exposed', 'M_exposed', 'V_exposed', 
               'W_exposed', 'Y_exposed']

polar = ['D_exposed', 'E_exposed', 'H_exposed', 'K_exposed', 'N_exposed', 'Q_exposed', 'R_exposed', 'T_exposed']

# create a sum of exposed AA column
df_normalized['Sum_AA_exposed'] = df_normalized[features_exposed].sum(axis=1)

# create a column for polar and hydrophobic exposed AA counts and normalise it
df_normalized['Polar_exposed'] = df_normalized[polar].sum(axis=1)
df_normalized['Polar_exposed'] = df_normalized['Polar_exposed']/df_normalized['Sum_AA_exposed']

df_normalized['Hydrophobic_exposed'] = df_normalized[hydrophobic].sum(axis=1)
df_normalized['Hydrophobic_exposed'] = df_normalized['Hydrophobic_exposed']/df_normalized['Sum_AA_exposed']

for i, feature in enumerate(features_exposed):
    df_normalized[feature] = df_normalized[feature]/df_normalized['Sum_AA_exposed']

In [36]:
# handle infinity values as missing values
pd.options.mode.use_inf_as_na = True
# change any NA values to 0, caused by division by 0
df_normalized = df_normalized.fillna(0)
# drop the summed column
df_normalized = df_normalized.drop(columns=['Sum_AA_exposed'])

### Logarithmic transformation

In [37]:
df_log = df_normalized.copy()
df_log['length'] = df_log['length'].transform(np.log2)
df_log['molecular_weight'] = df_log['molecular_weight'].transform(np.log2)
df_log['tasa_netsurfp2'] = df_log['tasa_netsurfp2'].transform(np.log2)
df_log['thsa_netsurfp2'] = df_log['thsa_netsurfp2'].transform(np.log2)

In [38]:
df_log.dropna(inplace=True)
df_log[['length', 'molecular_weight', 'tasa_netsurfp2', 'thsa_netsurfp2']].describe()

Unnamed: 0,length,molecular_weight,tasa_netsurfp2,thsa_netsurfp2
count,20381.0,20381.0,20381.0,20381.0
mean,8.669155,15.68517,14.682639,12.842945
std,1.134623,1.135034,1.054951,0.961493
min,1.0,8.120497,8.356479,7.222158
25%,7.960002,14.978191,13.99642,12.198453
50%,8.696968,15.712038,14.620437,12.853003
75%,9.388017,16.406327,15.362288,13.440654
max,13.824462,20.763749,19.447881,17.275738


## Application of MS filter

According to ProteomicsDB (https://www.proteomicsdb.org/#api), the following evidence classes exist:
- 0: not observed
- 1: observed with low scoring spectra 
- 2: observed with high quality spectra

Only proteins with observed spectra are retained to limit the bias caused by mass spectrometry detectability.

In [39]:
# load evidence ProteomicsDB data
with open(Data_path + '/ProteomicsDB/ProteomicsDB_evidence_positive.txt') as f:
    ProteomicsDB_evidence_positive = json.load(f)

In [40]:
# create list of proteins with evidence on ProteomicsDB
pos = set(ProteomicsDB_evidence_positive.keys())
pos_filtered_low = set([k for k, v in ProteomicsDB_evidence_positive.items() if v > 0])
print('Number of proteins in positive set:', len(pos))
print('Number of MS-detectable proteins in positive set with evidence score of 1 or higher:', len(pos_filtered_low))

Number of proteins in positive set: 18997
Number of MS-detectable proteins in positive set with evidence score of 1 or higher: 16791


In [41]:
# filter feature dataset for MS detected proteins
df_MS_filtered = df_log[df_log['id'].isin(pos_filtered_low)]
print('Number of MS-detectable proteins in human proteome dataset:', len(df_MS_filtered))

Number of MS-detectable proteins in human proteome dataset: 16790


# Save final feature data set

In [42]:
df_log[:5] # unfiltered human proteome dataset

Unnamed: 0,id,length,hydr_count,polar_count,molecular_weight,helix,turn,sheet,A,C,...,Methylation_MSD,coiled_coil,EGF,RAS_profile,RRM,ww_domain,transmembrane,TMHMM,Polar_exposed,Hydrophobic_exposed
0,Q8N7X0,10.703038,0.376125,0.431314,17.745289,0.225555,0.604079,0.170366,0.056989,0.011398,...,1.0,0,0,0,0,0,0,0,0.597984,0.161254
1,Q5T1N1,9.707359,0.29067,0.479665,16.719373,0.183014,0.777512,0.039474,0.052632,0.021531,...,1.0,0,0,0,0,0,0,0,0.570968,0.162903
2,Q92667,9.818582,0.376523,0.370986,16.793404,0.079734,0.805094,0.115172,0.079734,0.018826,...,0.0,1,0,0,0,0,1,1,0.447738,0.263651
3,Q5VUY0,8.668885,0.511057,0.312039,15.7064,0.4914,0.380835,0.127764,0.058968,0.039312,...,0.0,0,0,0,0,0,0,1,0.56,0.256
4,P62736,8.558421,0.427056,0.37931,15.574035,0.445623,0.34748,0.206897,0.076923,0.018568,...,1.0,0,0,0,0,0,0,0,0.68254,0.111111


In [43]:
df_MS_filtered[:5] # filtered human proteome dataset

Unnamed: 0,id,length,hydr_count,polar_count,molecular_weight,helix,turn,sheet,A,C,...,Methylation_MSD,coiled_coil,EGF,RAS_profile,RRM,ww_domain,transmembrane,TMHMM,Polar_exposed,Hydrophobic_exposed
2,Q92667,9.818582,0.376523,0.370986,16.793404,0.079734,0.805094,0.115172,0.079734,0.018826,...,0.0,1,0,0,0,0,1,1,0.447738,0.263651
4,P62736,8.558421,0.427056,0.37931,15.574035,0.445623,0.34748,0.206897,0.076923,0.018568,...,1.0,0,0,0,0,0,0,0,0.68254,0.111111
5,Q9H553,8.70044,0.471154,0.358173,15.73572,0.485577,0.375,0.139423,0.072115,0.028846,...,1.0,0,0,0,0,0,1,0,0.614286,0.185714
6,P0C7M7,9.179909,0.424138,0.37931,16.216178,0.32069,0.448276,0.231034,0.058621,0.022414,...,0.0,0,0,0,0,0,0,0,0.661111,0.088889
7,P49703,7.651052,0.41791,0.402985,14.652697,0.328358,0.477612,0.19403,0.099502,0.004975,...,1.0,0,0,0,0,0,0,0,0.56701,0.195876


In [44]:
# save final unfiltered feature dataset
df_log.to_csv(Data_path + '/curated/features_human_proteome.csv', index=False)

# save final MS filtered feature dataset
df_MS_filtered.to_csv(Data_path + '/curated/features_human_proteome_MS_filter.csv', index=False)