In [7]:
suppressWarnings(require('pkgmaker',quietly = T))
require('plyr',quietly = T)
l_ply(c('dplyr',
        'data.table',
#         'jsonlite', 
#         'rjson',
#         'httr',
#         'DEqMS', 
#         'pcaMethods',
#         'RMariaDB',
#         'parmigene',
#         'matrixTests',
#         'plotly',
        'tidyr',  
        'reshape2',
        'factoextra',
        'kableExtra',
        'parallel',
        'doParallel',
        'scales',
        'StatMatch',
        'lattice',
        'utils',
        'missMDA',
        'missForest',
        'pacman',
        'hablar',
        'tibble'
       ), function(pkg) {
          invisible(capture.output(irequire(pkg, quiet = T, autoinstall = T)))
        })

## Protein_intensity table

From /general_analysis/DataBaseUpload/NGS/Proteomics/bin/PhosphoProteomics_PreProccess.html:

Outlayiers: very low/high values (0.0001% and 99.9999%) were floored to these ranges. This believed to be an error and unlikely to be real value.

Possible reason: the original data have Inf. The R language calculated it to some extreme values.

In [5]:
# from /general_analysis/DataBaseUpload/NGS/Proteomics/bin/PhosphoProteomics_PreProccess.html
# `Counts` are just 2 to the power of Intensity. This is for DE analysis where counts are needed
# note: without filtering, the dimension is 1330824 rows
intensity <- read.csv("../proteomics/data/Protein_intensity.csv") 
# %>% filter(Intensity > -3.0) # remove values that are too small - think if this is necessary

In [12]:
head(intensity, 2)
print(dim(intensity))

id,Protein,ProteinGroup,ProteinGroupName,ProteinGroupId,Organism,Sample,Intensity,counts,hgnc_symbol
45243,sp|A0A087WV62|TVB16_HUMAN,sp|A0A087WV62|TVB16_HUMAN,TVB16_HUMAN,A0A087WV62,,CTG-0158,15.503,,
45244,sp|A0A087WV62|TVB16_HUMAN,sp|A0A087WV62|TVB16_HUMAN,TVB16_HUMAN,A0A087WV62,,CTG-0159,14.932,,


[1] 1330824      10


In [2]:
# intensity[intensity$Sample=='CTG-0166',]

In [3]:
# # check a case where there are repeats
# # The intensity was not calculated differently between GAL3A and GAL3B
# x <- intensity[intensity$Protein=='sp|A0A0B4J2D5|GAL3B_HUMAN',]
# x[x$Sample=='CTG-0158',]

In [8]:
int_mtx <- intensity %>% 
    select(c('ProteinGroup', 'Sample', 'Intensity'))  %>% 
    pivot_wider(names_from = Sample, values_from = Intensity, values_fn = mean) %>% 
#     pivot_wider(names_from = Sample, values_from = Intensity) %>% 
    column_to_rownames('ProteinGroup') %>% 
    na.omit
head(int_mtx, 2)
print(dim(int_mtx))

Unnamed: 0,CTG-0158,CTG-0159,CTG-0160,CTG-0162,CTG-0163,CTG-0166,CTG-0167,CTG-0178,CTG-0184,CTG-0464,...,CTG-3794,CTG-3795,CTG-3796,CTG-3797,CTG-3799,CTG-3800,CTG-3801,CTG-3802,CTG-3803,CTG-3805
sp|A0AVT1|UBA6_HUMAN,16.977,16.611,16.158,15.962,15.811,17.265,16.207,15.563,16.017,15.73,...,17.734,17.635,18.581,17.966,17.589,17.744,18.102,17.237,17.769,18.153
sp|O00170|AIP_HUMAN,17.204,16.658,16.086,17.218,15.891,16.885,16.563,15.533,15.924,16.724,...,19.256,18.914,18.943,19.286,18.819,18.973,19.104,19.178,17.665,19.075


[1] 295 317


In [9]:
rm(intensity)  # release memory since using int_mtx from here on

The intensity is already in log-scale because it has negative values.

## Standardize the data before imputation

*Standardizing the data makes it more interpretable for the errors after imputation.*

Missing values in proteomic data can be generally characterized into missing at random (MAR) and missing not at random (MNAR). 
+ MAR missing values mostly result from technical limitations and stochastic fluctuations in an abundance-independent manner.
+ MNAR missing values are more abundance-dependent that can be explained by the measurability of the corresponding peptides. 

Missing values in proteomic data are a mixture of MAR and MNAR. Although the real proportion is difficult to determine, it is believed that MNAR plays a dominant role in producing missing values.

In [10]:
int_z <- int_mtx %>%
    sweep(2, apply(int_mtx, 2, mean), '-')  %>% # column wise sweeping
    sweep(2, apply(int_mtx, 2, sd), '/')
head(int_z, 2)

Unnamed: 0,CTG-0158,CTG-0159,CTG-0160,CTG-0162,CTG-0163,CTG-0166,CTG-0167,CTG-0178,CTG-0184,CTG-0464,...,CTG-3794,CTG-3795,CTG-3796,CTG-3797,CTG-3799,CTG-3800,CTG-3801,CTG-3802,CTG-3803,CTG-3805
sp|A0AVT1|UBA6_HUMAN,-0.9964477,-1.182967,-1.199136,-1.5914926,-1.485159,-0.9319049,-1.173063,-1.410807,-1.534753,-1.667383,...,-0.9363064,-0.8872636,-0.5128011,-0.8438012,-0.8842406,-0.8417557,-0.6071568,-1.1700559,-0.7766287,-0.664963
sp|O00170|AIP_HUMAN,-0.8824764,-1.158639,-1.233628,-0.9642764,-1.445858,-1.1266279,-1.001664,-1.425311,-1.583753,-1.160003,...,-0.2431012,-0.3241733,-0.3522243,-0.2294483,-0.3559474,-0.2924846,-0.1695374,-0.2883464,-0.8231882,-0.2579414
