# Enrichr Results Compilation

Importing necessary packages, loading library segments, and setting up data for the benchmark dataset.

In [None]:
### Sourcing tools from GEO ###
    
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    
BiocManager::install("GenomicRanges")
BiocManager::install("rtracklayer")
BiocManager::install("usethis")
library(GenomicRanges)
library(rtracklayer)
library(usethis)
    
### Installing package dependencies ###
    
BiocManager::install("devtools")
BiocManager::install("roxygen2")
library(devtools)
library(roxygen2)

In [4]:
ChIPSeqDataMaster <- read.table(file = "/Users/mei/Desktop/benchmarks/genomic_range/data/GSA_ChIP_Seq_Master_Table.txt", sep = '\t', header = TRUE, quote = "")

In [5]:
## The following code creates a list of samples for which we need to extract the BED files for.
    
ChIPSeqSamples<-as.character(ChIPSeqDataMaster$GSM)
    
## Initializing list for storing BED files and the consecutive GRanges objects.
Samples_in_BED = list()
    
for(i in 1:length(ChIPSeqSamples))
    {
      Samples_in_BED[[i]] <- read.table(paste0("/Users/mei/LabWork/GSAChIPSeqBenchmarkDataBase/regen/",paste0(eval(parse(text="ChIPSeqSamples[i]")),".bed")), sep = "\t", header = FALSE)
      Samples_in_BED[[i]] <- Samples_in_BED[[i]][,1:3]
      colnames(Samples_in_BED[[i]]) <- c("chrom", "start", "end")
      Samples_in_BED[[i]] <- Samples_in_BED[[i]][order(Samples_in_BED[[i]]$chrom),]
      Samples_in_BED[[i]] <- GRanges(Samples_in_BED[[i]]$chrom, IRanges(Samples_in_BED[[i]]$`start`, Samples_in_BED[[i]]$`end`))
      genome(Samples_in_BED[[i]]) <- "hg19"
    }
    
## Saving BED files as GRanges objects ##
names(Samples_in_BED) <- ChIPSeqSamples

## GO Results

The results from Enrichr were manually curated for the samples from the benchmark dataset and stored at distinct locations for KEGG and GO enrichments; the folders were named **KEGG_2016** and **GO_BP_2016**, respectively. The code has been sourced from **Enrichr_Results_Compilaton.R** and **Enrichr_Preprocessing.R** functions available at https://github.com/mora-lab/benchmarks/tree/master/genomic_range/R.

In [6]:
tryCatch( {enrichr_go_samples <- read.table("/Users/mei/LabWork/GSAChIPSeqBenchmarkDataBase/Results/enrichr/GO_BP_2018/file_names.txt")}
          ,error = function(e){ print("File not found"); break;}
          ,finally = function (f){next;})

In [7]:
class(enrichr_go_samples)

In [8]:
head(enrichr_go_samples)

V1
<fct>
GSE84618.txt
GSM1847178.txt
GSM2058015.txt
GSM2058016.txt
GSM2058017.txt
GSM2058018.txt


As you can see here, **enrichr_go_samples** is a dataframe with one column holding the samples whose corresponding results are placed in the folder. We want to extract these results. For doing so, we shall begin with pruning the file names by removing extension- *.txt*.

In [9]:
enrichr_go_samples <- as.character(enrichr_go_samples$V1)
for(i in 1:length(enrichr_go_samples)){enrichr_go_samples[i] <- substr(enrichr_go_samples[i],1,nchar(enrichr_go_samples[i])-4)}

In [10]:
head(enrichr_go_samples)

Now that we have the precise sample labels, let us define a list **enrichr_go** that shall hold the enrichr results (dataframe), for each sample. 

In [11]:
enrichr_go <- list()

for (i in 1: length(ChIPSeqSamples))
{
  for(j in 1:length(enrichr_go_samples))
  {
    if(enrichr_go_samples[j] == ChIPSeqSamples[i])
    {
      enrichr_go[[j]] <-read.table(paste0("/Users/mei/LabWork/GSAChIPSeqBenchmarkDataBase/Results/enrichr/GO_BP_2018/",paste0(eval(parse(text='ChIPSeqSamples[i]')),".txt")), sep = '\t', header = TRUE, quote = "", fill = TRUE)
    }
  }
}

In [12]:
class(enrichr_go)

So, eventually **enrichr_go** is a list of dataframes.

In [13]:
head(enrichr_go[[1]])

Term,Overlap,P.value,Adjusted.P.value,Old.P.value,Old.Adjusted.P.value,Z.score,Combined.Score,Genes
<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
regulation of skeletal muscle contraction (GO:0014819),3/7,0.003747958,0.9441197,0.0052339067,0.7605585,-3.158475,17.64496,CAV3;DMPK;KCNJ2
cell-cell junction maintenance (GO:0045217),4/11,0.0015450556,0.9441197,0.0019520713,0.7605585,-2.577299,16.68207,CAMSAP3;NLGN2;PLEKHA7;CD177
regulation of transforming growth factor beta2 production (GO:0032909),2/7,0.0443520949,0.9441197,0.0427839212,0.7605585,-4.395784,13.69548,SMAD4;CDH3
regulation of histone H3-K9 acetylation (GO:2000615),3/8,0.0057748139,0.9441197,0.0069970387,0.7605585,-2.630837,13.55999,SMAD4;CHEK1;BRCA1
regulation of Golgi organization (GO:1903358),5/15,0.0006100561,0.9441197,0.0007217959,0.7605585,-1.822244,13.48818,CAMSAP3;MAP2K1;STX18;RBSN;MAPK3
chondroitin sulfate catabolic process (GO:0030207),4/14,0.004155643,0.9441197,0.0040004439,0.7605585,-2.408644,13.20729,BCAN;HYAL1;NCAN;ARSB


We observe an element from the list, and see that it outputs quite a bit og information. Following the scope of our analysis, we shall only require the **Term** and **P.value** parameters. So, the remaining columns are filtered out.

In [14]:
enrichr_go_results <- list()
for (i in 1:length(enrichr_go))
{
  enrichr_go_results[[i]] <- enrichr_go[[i]][,c(1,3)]
}

names(enrichr_go_results) <- as.character(enrichr_go_samples)

## KEGG Results

We shall follow the exact same protocol for KEGG results as well.

In [15]:
## Same protocol for ENRICHR KEGG results too. 


enrichr_kegg <- list()

enrichr_kegg_samples <- read.table("/Users/mei/LabWork/GSAChIPSeqBenchmarkDataBase/Results/enrichr/KEGG_2016/file_names.txt")
enrichr_kegg_samples <- as.character(enrichr_kegg_samples$V1)
for(i in 1:length(enrichr_kegg_samples)){enrichr_kegg_samples[i] <- substr(enrichr_kegg_samples[i],1,nchar(enrichr_kegg_samples[i])-4)}



enrichr_kegg <- list()

for (i in 1: length(ChIPSeqSamples))
{
  for(j in 1:length(enrichr_kegg_samples))
  {
    if(enrichr_kegg_samples[j] == ChIPSeqSamples[i])
    {
      enrichr_kegg[[j]] <-read.table(paste0("/Users/mei/LabWork/GSAChIPSeqBenchmarkDataBase/Results/enrichr/KEGG_2016/",paste0(eval(parse(text='ChIPSeqSamples[i]')),".txt")), sep = '\t', header = TRUE, quote = "", fill = TRUE)
    }
  }
}


## Condensed Results

enrichr_kegg_results <- list()
for (i in 1:length(enrichr_kegg))
{
  enrichr_kegg_results[[i]] <- enrichr_kegg[[i]][,c(1,3)]
}

names(enrichr_kegg_results) <- as.character(enrichr_kegg_samples)


Now that we have the individual enrichment results from GO and KEGG ontologies, we shall proceed with combining the results for each sample to engender a consolidated set. Prior to that, let's check if the sample count is consistent for both cohorts.

In [16]:
# for consistency with Chipenrich, Broadenrich, Seq2pathway results' variable nomenclature.
enrichr_go_results_shredded <- enrichr_go_results
enrichr_kegg_results_shredded <- enrichr_kegg_results

count <- 0
for (i in 1:length(names(enrichr_go_results_shredded)))
{
  for(j in 1:length(names(enrichr_kegg_results_shredded)))
  {
    if(names(enrichr_go_results_shredded)[i] == names(enrichr_kegg_results_shredded)[j])
    {
      count <- count+1
    }
  }
}
print (count)

[1] 106


Cool ! That shows that we have results from  both KEGG and GO for all samples; number of samples in our benchmark dataset is 106.

In [18]:
## Refining Enrichr KEGG and GO results | Extracting "hsa*****" and "GO:*****" terms.
#install.packages("stringr", repos = "https://mirrors.tuna.tsinghua.edu.cn/CRAN/")
BiocManager::install("stringr")
library(stringr)
enrichr_go_terms_extracted_results_shredded <- list()
enrichr_go_terms_extracted_results_shredded <- enrichr_go_results_shredded
for (i in 1:length(enrichr_go_results_shredded))
{
  enrichr_go_terms_extracted_results_shredded[[i]]$Term <- str_extract(string = print(eval(parse(text=paste0("enrichr_go_results_shredded$",paste0(eval(parse(text="names(enrichr_go_results_shredded)[i]")),"$Term"))))), pattern = "GO:[0-9]+")
}

enrichr_kegg_terms_extracted_results_shredded <- list()
enrichr_kegg_terms_extracted_results_shredded <- enrichr_kegg_results_shredded
for (i in 1:length(enrichr_kegg_results_shredded))
{
  enrichr_kegg_terms_extracted_results_shredded[[i]]$Term <- str_extract(string = print(eval(parse(text=paste0("enrichr_kegg_results_shredded$",paste0(eval(parse(text="names(enrichr_kegg_results_shredded)[i]")),"$Term"))))), pattern = "hsa[0-9]+")
}


Bioconductor version 3.8 (BiocManager 1.30.4), R 3.5.2 (2018-12-20)
Installing package(s) 'stringr'



The downloaded binary packages are in
	/var/folders/hm/c3_fjypn62v5xh5b5ygv267m0000gn/T//RtmpxWUpqS/downloaded_packages


Update old packages: 'callr', 'checkmate', 'deSolve', 'digest', 'dplyr',
  'ellipsis', 'emmeans', 'gam', 'git2r', 'haven', 'lavaan', 'mclust', 'pillar',
  'plotmo', 'processx', 'quantreg', 'recipes', 'remotes', 'rlang', 'rngtools',
  'robust', 'sf', 'tinytex', 'vdiffr', 'xfun'
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



  [1] Oxytocin signaling pathway_Homo sapiens_hsa04921                                             
  [2] Aldosterone synthesis and secretion_Homo sapiens_hsa04925                                    
  [3] Acute myeloid leukemia_Homo sapiens_hsa05221                                                 
  [4] Oocyte meiosis_Homo sapiens_hsa04114                                                         
  [5] Gastric acid secretion_Homo sapiens_hsa04971                                                 
  [6] Long-term potentiation_Homo sapiens_hsa04720                                                 
  [7] Inflammatory mediator regulation of TRP channels_Homo sapiens_hsa04750                       
  [8] Adrenergic signaling in cardiomyocytes_Homo sapiens_hsa04261                                 
  [9] Vascular smooth muscle contraction_Homo sapiens_hsa04270                                     
 [10] Salivary secretion_Homo sapiens_hsa04970                                                     


## Combining results from both streams.

In [33]:
## Combining Enrichr KEGG and GO results

enrichr_results_shredded <- list()
length(enrichr_results_shredded) <- length (enrichr_go_terms_extracted_results_shredded)
names(enrichr_results_shredded) <- names (enrichr_go_terms_extracted_results_shredded)

for (i in 1:length(enrichr_results_shredded))
{
  if(names(enrichr_results_shredded) == names(enrichr_go_terms_extracted_results_shredded) && names(enrichr_results_shredded) == names(enrichr_kegg_terms_extracted_results_shredded))
  {
    enrichr_results_shredded[[i]] <- rbind(enrichr_go_terms_extracted_results_shredded[[i]], enrichr_kegg_terms_extracted_results_shredded[[i]], stringsAsFactors = FALSE)
  }
}

“invalid factor level, NA generated”

In [34]:
class(enrichr_results_shredded)

In [35]:
class(enrichr_results_shredded[[1]])

In [49]:
class(enrichr_results_shredded[[1]]$Term)

In [50]:
class(enrichr_results_shredded[[1]]$P.value)

There is hence an articulation between the different objects in the results. The **enrichr_results_shredded** is the list holding enrichment results from both KEGG and GO ontologies for each sample. In the downstream analysis, it will be required to compare with the actual terms in the disease pools and so the positioning is important from the perspective of ranks and prioritization calculations. To accomplish that an ordering by ascendance is warranted. 

In [51]:
## Sorting on the basis of "P.value"

for (i in 1:length(enrichr_results_shredded))
{
  enrichr_results_shredded[[i]] <- enrichr_results_shredded[[i]][with(enrichr_results_shredded[[i]], order(enrichr_results_shredded[[i]]$P.value)), ] 
}

ERROR: Error in order(...): argument 1 is not a vector
