Written by Arjana Begzati

In [3]:
import pandas as pd
import os
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('max_colwidth', None)

data_path = '/nobackup/users/hmbaghda/metastatic_potential/'

In [8]:
# load TableS2 from Nusinow et al., 2020, Cell 180, 387–402
data_df = pd.read_excel(os.path.join(data_path, 'raw', 'TableS2.xlsx'), 
                                     sheet_name='Normalized Protein Expression')
data_df.shape

(12755, 16384)

In [3]:
# remove columns that are detected peptide numbers
data_df = data_df.iloc[:, [not (c.endswith('_Peptides') or c.startswith('Column')) for c in data_df.columns]]
data_df.shape

(12755, 384)

In [4]:
# determine number of samples that corresponds to 80% of all samples
p80_count = data_df.iloc[:, ['_TenPx' in c for c in data_df.columns]].shape[1]*0.8
p80_count

302.40000000000003

In [5]:
# determine number of missing values per feature 
nan_count_per_row = data_df.iloc[:, ['_TenPx' in c for c in data_df.columns]].isna().sum(axis=1)
# remove proteins that are missing in >80% samples
cols_to_keep = nan_count_per_row[nan_count_per_row<p80_count].index
data_df = data_df.iloc[cols_to_keep, :]
data_df.shape

(10969, 384)

In [52]:
data_df.to_csv(os.path.join(data_path, 'interim', 
                            'TableS2_PepNumbColsRemoved_80pSamplesMissingProtsRemoved.csv'), 
               index=False)

performed imputation of missing values in Perseus v2.1.3.0: used "Replace missing values from normal distribution" method (https://cox-labs.github.io/coxdocs/replacemissingfromgaussian.html), which fills NaNs with values sampled from protein's distribution shifted down by 1.8 std and squeezed to std*0.3

Summary:

Normalized protein expression data from Nusinow et al. were used. Proteins missing values in more than 80% of samples (n=1,786 out of 12,755) were removed. The remaining missing values were then imputed in Perseus (version 2.1.3.0) by random sampling from the protein’s distribution after shifting it downward by 1.8 standard deviations and shrinking its standard deviation by a factor of 0.3.