# Towards Benchmarking Enrichment Tools for Genomic Regions. 

Create a file named GSAChIPSeqBenchmarkProfile.txt, which is a tab-separated master table       with following fields: S.No.,	Experimental_Method	Organism,	Cell_Type_Tissue,	Genome_Version,	Disease_Target_Pathway,	Samples (Per Experiment),	Replicates,	PMID,	Publication_Journal,	Publication_Year,	GSE,	GSM,	TF_or_Histone_Mark,	Cistrome_ID,	KEGG_ID,	Metadata.

All the marked function are to be sourced from the github repository. (https://github.com/mora-lab/benchmarks)

The first step is towards structuring the benchmark dataset. This will require carefully choosing the apposite datasets (GSMs) from the repository https://www.ncbi.nlm.nih.gov/gds/. This repertoire shall constitute our benchmark dataset for comparing gene-set enrichment tools for genomic regions. The preprocessing of the datasets is as follows:
    1. If the files are available as *wig*, they need to be converted to BED format using *wig2bed* from *bedtools* package in the command line.
    2. Choosing 3 essential fields from the BED files; *chrom*, *start*, and *end*, signifying the chromosomal number, start index, and end index respectively.
    3. Removing entries from mitochondrial DNA (*chrM*, *chrMT*) and some random chromosome labels.
    4. Converting the BED files into Genomic Ranges format, using package **genomicranges**. This will not only regularise the file format, but will also optimise the disk usage.
    5. Compile individual genomic ranges files into a **genomic ranges list**.
The compound dataset shall be packaged eventually into an R package. 

In [None]:
source(Data_Import_Cleaning_Function.R)

Next step is to load the benchmark package and relative dependencies. The packages of tools in question shall also be installed and the libraries loaded at this time. The following chunk of code helps achieve the same. 

In [None]:
source(Installing_Packages_Benchmark.R)

The above function will install the tools that we intend to compare as well as the package that we have constructed out of our benchmark dataset.

In [None]:
source(Execute_Chipenrich_Broadenrich_Seq2pathway.R)

The function shall execute the three tools in order and respectively save the results as data files in R

In [None]:
source(Extracting_Valued_Results_Chipenrich_Broadenrich_Seq2pathway.R)

Manually curate the outputs from GREAT and Enrichr tools as they currently do not facilitate any programming interface.

In [None]:
source(Enrichr_Results_Compilation.R)

In [None]:
source(GREAT_Results_Compilation.R)

We are handling 5 tools in this comparison study, viz. Chipenrich, Broiadenrich, Seq2pathway, Enrichr, and GREAT. 

After compiling the results, we proceed with the calculation of comparison metrics: prioritization, sensitivity, specificity, and precision.  

In [None]:
source(Prioritization.R)

In [None]:
source(SnSpPr.R)

Next is the visualization of the results.

In [None]:
source(Plotting_Comparison_Metrics.R)

Now to test the robustness of the tools, we shall introduce noise. The concept is depicted in the figure below. 

![Simulation](Simulation.jpg)

In [None]:
source(Simulation_For_False_negatives.R)

In [None]:
source(Simulation_For_False_Positives.R)

With the simulation results, we rerun the entire pipeline as above and check for discrepancies in results. If the results are consistent, we preserve the ranking of the tools.