# DIRAC Analyses of LC M001 Proteomics and Related Transcriptomics — Statistical Tests for Comparisons

***by Kengo Watanabe***  

In the main Python notebook, the results of differential rank conservation (DIRAC; Eddy, J.A. et al. PLoS Comput. Biol. 2010) analyses are compared between the Longevity Consortium (LC) M001 proteomics dataset and the related transcriptomics dataset (Tyshkovskiy, A. et al. Cell Metab. 2019).  
**–> To maintain the consistency with the other DIRAC analyses, statistical tests are performed in this sub-notebook with R kernel.**  

Input:  
* Combined module metadata: 220525_LC-M001-DIRAC-prot-vs-txn_Comparison-GOBP_ver2-2_module-metadata.tsv  
* Combined sample–mouse metadata: 220525_LC-M001-DIRAC-prot-vs-txn_Comparison-GOBP_ver2-2_sample-metadata.tsv  
* Cleaned table of DIRAC RMSs (proteomics): 220525_LC-M001-DIRAC-prot-vs-txn_Comparison-GOBP_ver2-2_proteomics-RMS.tsv  
* Cleaned table of DIRAC RCIs (proteomics): 220525_LC-M001-DIRAC-prot-vs-txn_Comparison-GOBP_ver2-2_proteomics-RCI.tsv  
* Cleaned table of DIRAC RMSs (transcriptomics): 220525_LC-M001-DIRAC-prot-vs-txn_Comparison-GOBP_ver2-2_transcriptomics-RMS.tsv  
* Cleaned table of DIRAC RCIs (transcriptomics): 220525_LC-M001-DIRAC-prot-vs-txn_Comparison-GOBP_ver2-2_transcriptomics-RCI.tsv  

Output:  
* Supplementary Data 6  

Original notebook (memo for my future tracing):  
* dalek:[JupyterLab HOME]/220523_LC-M001-DIRAC-prot-vs-txn/220525_LC-M001-DIRAC-prot-vs-txn_StatisticalTest-GOBP_ver2-2.ipynb  

In [None]:
library("tidyverse")
options(repr.plot.width=5, repr.plot.height=5)#Default=7x7

#CRAN
for (package in c("multcomp", "openxlsx")) {
    #install.packages(package)
    eval(bquote(library(.(package))))
    print(str_c(package, ": ", as.character(packageVersion(package))))
}

## 1. Prepare metadata and DIRAC results

In [None]:
#Import module metadata
fileDir <- "./ExportData/"
ipynbName <- "220525_LC-M001-DIRAC-prot-vs-txn_Comparison-GOBP_ver2-2_"
fileName <- "module-metadata.tsv"
temp <- read_delim(str_c(fileDir,ipynbName,fileName), delim="\t")
print(str_c("nrow: ",nrow(temp)))
head(temp)

module_meta <- temp

In [None]:
#Import sample-mouse metadata
fileDir <- "./ExportData/"
ipynbName <- "220525_LC-M001-DIRAC-prot-vs-txn_Comparison-GOBP_ver2-2_"
fileName <- "sample-metadata.tsv"
temp <- read_delim(str_c(fileDir,ipynbName,fileName), delim="\t")
print(str_c("nrow: ",nrow(temp)))
head(temp)

sample_meta <- temp

In [None]:
#Import DIRAC results
fileDir <- "./ExportData/"
ipynbName <- "220525_LC-M001-DIRAC-prot-vs-txn_Comparison-GOBP_ver2-2_"
fileName <- "proteomics-RMS.tsv"
temp1 <- read_delim(str_c(fileDir,ipynbName,fileName), delim="\t")
print(str_c("nrow: ",nrow(temp1)))
head(temp1)

fileName <- "proteomics-RCI.tsv"
temp2 <- read_delim(str_c(fileDir,ipynbName,fileName), delim="\t")
print(str_c("nrow: ",nrow(temp2)))
head(temp2)

rms_prot <- temp1
rci_prot <- temp2

In [None]:
#Import DIRAC results
fileDir <- "./ExportData/"
ipynbName <- "220525_LC-M001-DIRAC-prot-vs-txn_Comparison-GOBP_ver2-2_"
fileName <- "transcriptomics-RMS.tsv"
temp1 <- read_delim(str_c(fileDir,ipynbName,fileName), delim="\t")
print(str_c("nrow: ",nrow(temp1)))
head(temp1)

fileName <- "transcriptomics-RCI.tsv"
temp2 <- read_delim(str_c(fileDir,ipynbName,fileName), delim="\t")
print(str_c("nrow: ",nrow(temp2)))
head(temp2)

rms_txn <- temp1
rci_txn <- temp2

## 2. Rank conservation index: general pattern

> This would NOT be used for the manuscript; hence, statistical test is skipped in this version.  

## 3. Rank conservation index: inter-group module comparison

> Test specific hypothesis: control RCI == intervention RCI (i.e., inter-group module comparison).  
> 1. Testing the main effect of intervention on RMSs for each module using ANOVA model  
> 2. Then, performing post-hoc comparisons of RMSs between control vs. each intervention using the repeated Student's t-tests  
>  
> Basically, statistical strategy is same with the one used in each dataset analysis. Because RMS/RCI was not normalized (i.e., the expected mean and variance could be different between datasets due to different number of mapped analytes), dataset and its interaction term are NOT included in ANOVA model; instead, ANOVA model is generated per dataset. The p-value adjustment is performed in a conservative manner: the P-values in ANOVA tests are adjusted across all models (= modules x datasets), and those in post-hoc tests are adjusted across datasets only within the module (not across modules). Not Dunnett's test but Student's t-test (i.e., t-test with pooled variance) is used as the post-hoc test because further adjustment of the Dunnett's test p-values with the Holm-Bonferroni method is too much (incorrect) adjustment for family-wise error rate (FWER).  

### 3-1. Extract RMS under the own phenotype consensus

In [None]:
#Extract RMS whose template phenotype corresponds to the own phenotype
temp <- tibble(ModuleID=module_meta$ModuleID)
##Proteomics
temp1 <- rms_prot
phenotype_vec <- rci_prot %>%
    dplyr::select(-ModuleID, -Template) %>%
    names()
for (k in phenotype_vec) {
    temp2 <- sample_meta %>%
        dplyr::filter(Group==!!k) %>%
        .$SampleID
    temp <- temp1 %>%
        dplyr::filter(Template==!!k) %>%
        dplyr::select(ModuleID, all_of(temp2)) %>%
        dplyr::left_join(temp, ., by="ModuleID")
}
##Transcriptomics
temp1 <- rms_txn
phenotype_vec <- rci_txn %>%
    dplyr::select(-ModuleID, -Template) %>%
    names()
for (k in phenotype_vec) {
    temp2 <- sample_meta %>%
        dplyr::filter(Group==!!k) %>%
        .$SampleID
    temp <- temp1 %>%
        dplyr::filter(Template==!!k) %>%
        dplyr::select(ModuleID, all_of(temp2)) %>%
        dplyr::left_join(temp, ., by="ModuleID")
}

print(str_c("nrow: ",nrow(temp)))
head(temp)
summary(temp[, 1:10])

rms_kk <- temp

### 3-2. ANOVA test (RMS ~ Intervention), followed by repeated Student's t-tests (Intervention)

#### 3-2-1. Simultaneously perform all ANOVA tests

In [None]:
#Prepare DF
group_vec <- c("Cont", "Acar", "Rapa")
temp <- rms_kk %>%
    tidyr::gather(key=SampleID, value=RMS, -ModuleID) %>%
    dplyr::left_join(., sample_meta, by="SampleID") %>%
    dplyr::mutate(Intervention=str_replace(Intervention, "Control", "Cont")) %>%
    dplyr::mutate(Intervention=str_replace(Intervention, "Acarbose", "Acar")) %>%
    dplyr::mutate(Intervention=str_replace(Intervention, "Rapamycin", "Rapa")) %>%
    dplyr::mutate(Intervention=factor(Intervention, levels=group_vec),
                  Sex=factor(Sex, levels=c("F", "M")),
                  Age=factor(Age, levels=c("6m", "12m")))
print(nrow(temp))
head(temp)

#Simultaneously perform all tests using tidyr::nest()
temp <- temp %>%
    dplyr::group_by(ModuleID, Dataset) %>%
    tidyr::nest() %>%#New column name becomes "data"
    dplyr::mutate(ANOVA=lapply(data, function(tbl) {aov(RMS~Intervention, data=tbl)})) %>%
    dplyr::ungroup()
print(nrow(temp))
print(head(temp))#print() because Jupyter Lab tries to display list contents

model1 <- temp

In [None]:
#Check result object
summary(model1$ANOVA[[1]])

#### 3-2-2. Simultaneously perform all Student's t-tests

In [None]:
#Prepare DF
temp1 <- rms_kk %>%
    tidyr::gather(key=SampleID, value=RMS, -ModuleID) %>%
    dplyr::left_join(., sample_meta, by="SampleID") %>%
    dplyr::mutate(Intervention=str_replace(Intervention, "Control", "Cont")) %>%
    dplyr::mutate(Intervention=str_replace(Intervention, "Acarbose", "Acar")) %>%
    dplyr::mutate(Intervention=str_replace(Intervention, "Rapamycin", "Rapa"))
comparison_vec <- c("Acar-vs-Cont", "Rapa-vs-Cont")
temp <- tibble()
for (comparison in comparison_vec) {
    contrast <- str_split(comparison, "-vs-", simplify=TRUE)[1]
    baseline <- str_split(comparison, "-vs-", simplify=TRUE)[2]
    temp2 <- temp1 %>%
        dplyr::filter(Intervention==!!contrast) %>%
        dplyr::select(ModuleID, Dataset, SampleID, RMS) %>%
        dplyr::rename(SampleID_contrast=SampleID,
                      RMS_contrast=RMS) %>%
        dplyr::group_by(ModuleID, Dataset) %>%
        dplyr::mutate(Sample_i=1:n()) %>%#Just for handling; no correspondence b/w baseline and contrast
        dplyr::ungroup()
    temp3 <- temp1 %>%
        dplyr::filter(Intervention==!!baseline) %>%
        dplyr::select(ModuleID, Dataset, SampleID, RMS) %>%
        dplyr::rename(SampleID_baseline=SampleID,
                      RMS_baseline=RMS) %>%
        dplyr::group_by(ModuleID, Dataset) %>%
        dplyr::mutate(Sample_i=1:n()) %>%#Just for handling; no correspondence b/w baseline and contrast
        dplyr::ungroup()
    temp <- dplyr::full_join(temp2, temp3, by=c("ModuleID", "Dataset", "Sample_i")) %>%
        dplyr::mutate(Comparison=!!comparison) %>%
        dplyr::select(ModuleID, Dataset, Comparison, Sample_i,
                      SampleID_contrast, RMS_contrast, SampleID_baseline, RMS_baseline) %>%
        dplyr::bind_rows(temp, .)
}
print(nrow(temp))
head(temp)

#Check NAs which can be derived by full_join when sample size is different b/w baseline and contrast
temp1 <- temp %>%
    dplyr::filter(is.na(RMS_contrast)|is.na(RMS_baseline)) %>%
    dplyr::group_by(ModuleID, Dataset, Comparison) %>%
    dplyr::summarize(Count=n()) %>%
    dplyr::ungroup()
print(str_c('Test with different sample size: ',nrow(temp1)))

#Simultaneously perform all tests using tidyr::nest()
temp <- temp %>%
    dplyr::group_by(ModuleID, Dataset, Comparison) %>%
    tidyr::nest() %>%#New column name becomes "data"
    dplyr::mutate(Student=lapply(data, function(tbl) {
        t.test(tbl$RMS_contrast, tbl$RMS_baseline, alternative="two.sided", mu=0,
               paired=FALSE, var.equal=TRUE, conf.level=0.95)})) %>%
    dplyr::ungroup()
print(nrow(temp))
print(head(temp))#print() because Jupyter Lab tries to display list contents

model2 <- temp

In [None]:
#Check result object
temp <- model2$Student[[1]]
summary(temp)
for (name in rownames(summary(temp))) {
    print(name)
    print(temp[[name]])
    print("")
}

#### 3-2-3. Summarize all result objects into a table

In [None]:
#Prepare variable labels
variable_vec <- rownames(summary(model1$ANOVA[[1]])[[1]]) %>%
    str_replace(., " *$", "")#Remove white spaces
variable_vec <- variable_vec[1:(length(variable_vec)-1)]#Remove Residuals

#Prepare summary table of ANOVA tests
temp1 <- model1 %>%
    dplyr::select(ModuleID, Dataset, ANOVA)
for (i in 1:length(variable_vec)) {
    label <- variable_vec[i]
    temp1 <- temp1 %>%
        dplyr::mutate("{label}_DF":=sapply(ANOVA, function(aov) {summary(aov)[[1]]$Df[i]}),
                      "{label}_SumSq":=sapply(ANOVA, function(aov) {summary(aov)[[1]]$`Sum Sq`[i]}),
                      "{label}_MeanSq":=sapply(ANOVA, function(aov) {summary(aov)[[1]]$`Mean Sq`[i]}),
                      "{label}_Fstat":=sapply(ANOVA, function(aov) {summary(aov)[[1]]$`F value`[i]}),
                      "{label}_Pval":=sapply(ANOVA, function(aov) {summary(aov)[[1]]$`Pr(>F)`[i]})) %>%
        #P-value adjustment with the Benjamini-Hochberg method
        ##Using !!as.name() in the following line, because simple {{}} and !! didn't recognize?
        dplyr::mutate("{label}_AdjPval":=p.adjust(!!as.name(str_c(label,"_Pval")), method="BH"))
}
temp1 <- temp1 %>%
    dplyr::select(-ANOVA)
print(str_c("nrow: ",nrow(temp1)))
head(temp1)

#Prepare summary table of t-tests
temp2 <- model2 %>%
    dplyr::select(ModuleID, Dataset) %>%
    dplyr::distinct()
for (i in 1:length(comparison_vec)) {
    label <- comparison_vec[i]
    temp2 <- model2 %>%
        dplyr::filter(Comparison==!!label) %>%
        dplyr::mutate("{label}_DF":=sapply(Student, function(htest) {unname(htest[["parameter"]])}),
                      "{label}_Coef":=sapply(Student, function(htest) {
                          unname(htest[["estimate"]][1]-htest[["estimate"]][2])}),
                      "{label}_CoefSE":=sapply(Student, function(htest) {unname(htest[["stderr"]])}),
                      "{label}_tStat":=sapply(Student, function(htest) {unname(htest[["statistic"]])}),
                      "{label}_Pval":=sapply(Student, function(htest) {unname(htest[["p.value"]])}),
                      "{label}_AdjPval":=1.0) %>%#Insert dummy column for now
        dplyr::select(-Comparison, -data, -Student) %>%
        dplyr::left_join(temp2, ., by=c("ModuleID", "Dataset"))
}
##P-value adjustment across datasets with the Holm-Bonferroni method
temp3 <- temp2 %>%
    dplyr::select(ModuleID, Dataset, all_of(str_c(comparison_vec,"_Pval"))) %>%
    tidyr::gather(key=ColName, value=Pval, -ModuleID, -Dataset) %>%
    dplyr::group_by(ModuleID) %>%
    dplyr::mutate(AdjPval=p.adjust(Pval, method="holm")) %>%
    dplyr::ungroup() %>%
    dplyr::select(-Pval) %>%
    dplyr::mutate(ColName=str_replace(ColName, "_Pval", "_AdjPval_temp")) %>%
    tidyr::spread(key=ColName, value=AdjPval)
##Replace the dummy values with the adjusted p-values
temp2 <- dplyr::left_join(temp2, temp3, by=c("ModuleID", "Dataset"))
for (i in 1:length(comparison_vec)) {
    label <- comparison_vec[i]
    temp2 <- temp2 %>%
        dplyr::mutate("{label}_AdjPval":=!!as.name(str_c(label,"_AdjPval_temp"))) %>%
        dplyr::select(-!!as.name(str_c(label,"_AdjPval_temp")))
}
print(str_c("nrow: ",nrow(temp2)))
head(temp2)

#Merge
temp <- dplyr::left_join(temp1, temp2, by=c("ModuleID", "Dataset"))
print(str_c("nrow: ",nrow(temp)))
head(temp)

summary_tbl <- temp

In [None]:
#Add general statistics
##Calculate general statistics
sem <- function(x) {sd(x)/sqrt(length(x))}
temp <- rms_kk %>%
    tidyr::gather(key=SampleID, value=RMS, -ModuleID) %>%
    dplyr::left_join(., sample_meta, by="SampleID") %>%
    dplyr::mutate(Intervention=str_replace(Intervention, "Control", "Cont")) %>%
    dplyr::mutate(Intervention=str_replace(Intervention, "Acarbose", "Acar")) %>%
    dplyr::mutate(Intervention=str_replace(Intervention, "Rapamycin", "Rapa")) %>%
    dplyr::mutate(Intervention=factor(Intervention, levels=group_vec),
                  Sex=factor(Sex, levels=c("F", "M")),
                  Age=factor(Age, levels=c("6m", "12m"))) %>%
    dplyr::group_by(ModuleID, Dataset, Intervention) %>%
    dplyr::summarize(Count=n(), RMSmean=mean(RMS), RMSsem=sem(RMS)) %>%
    dplyr::ungroup()
temp1 <- module_meta %>%
    dplyr::select(ModuleID, ModuleName)
temp1 <- dplyr::bind_rows(temp1, temp1) %>%
    dplyr::mutate(Dataset=rep(c("Proteomics", "Transcriptomics"), each=nrow(module_meta)))
for (group in group_vec) {
    temp1 <- temp %>%
        dplyr::filter(Intervention==!!group) %>%
        dplyr::select(-Intervention) %>%
        dplyr::rename("{group}_N":=Count,
                      "{group}_RMSmean":=RMSmean,
                      "{group}_RMSsem":=RMSsem) %>%
        dplyr::left_join(temp1, ., by=c("ModuleID", "Dataset"))
}
print(str_c("nrow: ",nrow(temp1)))
head(temp1)
##Merge
temp <- dplyr::left_join(temp1, summary_tbl, by=c("ModuleID", "Dataset")) %>%
    dplyr::arrange(Intervention_Pval)
print(str_c("nrow: ",nrow(temp)))
head(temp)

summary_tbl <- temp

> (Note that the beta-coefficient estimate is equivalent to the difference in the mean of RMSs; e.g., Acar-vs-Cont_Coef = Acar_RMSmean - Cont_RMSmean.)  

In [None]:
#Create a workbook object to save as one single .xlsx file
workbook <- createWorkbook()

#Prepare module metadata sheet
sheetName <- "ModuleMetadata"
addWorksheet(workbook, sheetName=sheetName)
writeData(workbook, sheetName, module_meta)

#Save the summary table as a new sheet per dataset
for (dataset in c("Proteomics", "Transcriptomics")) {
    temp <- summary_tbl %>%
        dplyr::filter(Dataset==!!dataset) %>%
        dplyr::select(-Dataset)
    print(str_c("nrow: ",nrow(temp)))
    print(head(temp))
    
    sheetName <- str_c("RCI-",dataset)
    addWorksheet(workbook, sheetName=sheetName)
    writeData(workbook, sheetName, temp)
}

#Save the workbook as one single .xlsx file
fileDir <- "./ExportData/"
ipynbName <- "220525_LC-M001-DIRAC-prot-vs-txn_StatisticalTest-GOBP_ver2-2_"
fileName <- "inter-group-comparison.xlsx"
saveWorkbook(workbook, file=str_c(fileDir,ipynbName,fileName), overwrite=TRUE)

# — Go back to the main Python notebook —  

# — Session information —

In [None]:
sessionInfo()