Skip to content

ozanozisik/orsum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

orsum logo

v1.7.0

orsum, which stands for "over-representation summary", is a tool to filter long lists of enriched terms resulting from one or more enrichment analyses. Filtering in orsum is based on a simple principle, a term is discarded if there is a more significant term that annotates at least the same genes; in other words, the more significant ancestor (general) term represents the less significant descendant (specific) term. The remaining term becomes the representative term for the discarded term. orsum works on hierarchical annotations, e.g. GO and REACTOME.

As input, orsum requires enrichment results - one or more files containing the enriched term IDs, and a GMT file contaning the gene sets.

The ancestor-descendant relations among terms are inferred from the GMT file. The GMT file for the annotations can be found from multiple sources, for consistency, if it is available, it is better to download the exact GMT file that is used in enrichment analysis. If you already use g:Profiler for enrichment or GMT file was not available from your enrichment tool, getting from this g:Profiler data source link is an option. This zip file contains GMT for Reactome, GO:BP, GO:MF and GO:CC which you can use with orsum.

The input files containing the enriched term IDs (e.g. GO:0008150 or REAC:R-HSA-1640170) must contain one ID per line and must be sorted, with the most significant term at top. There should not be a header.

orsum produces multiple files as output:

  • An HTML file which presents the filtered results - the list of representative terms - with an option to click on each term to see the represented (discarded) terms.
  • Two TSV files (-Summary.tsv, -Detailed.tsv) which contain the information in the HTML file in different formats, with some extras, like term size. -Summary.tsv file contains only representative terms while -Detailed.tsv additionally contains representative terms. -Summary.tsv file is very useful for getting an overview when comparing multiple enrichment results.
  • A TSV file that consists of two columns, mapping representative term IDs to representing term IDs. This file is for programmatic access in case it is needed.
  • Four figures:
    • A heatmap presenting the top representative terms, colored according to the quartile of their ranks in each input enrichment result.
    • A clustered heatmap presenting the top representative terms, colored according to the quartile of their ranks in each input enrichment result. This is useful to see common and different terms between multiple inputs.
    • A barplot presenting the top representative terms and how many terms they represent.
    • A plot representing term size vs rank of representative terms

In order to use the tool you can either download the .py files from this repository and run orsum.py or you can download from bioconda.
conda install -c bioconda orsum


Usage:
orsum.py [-h] [-v] --gmt GMT --files FILES [FILES ...] [--fileAliases FILEALIASES [FILEALIASES ...]] [--outputFolder OUTPUTFOLDER] [--maxRepSize MAXREPSIZE] [--maxTermSize MAXTERMSIZE] [--minTermSize MINTERMSIZE] [--numberOfTermsToPlot NUMBEROFTERMSTOPLOT]

  • --gmt: Path of the GMT file. (required)
  • --files: Paths of the enrichment result files. (required)
  • --fileAliases: Aliases for input enrichment result files to be used in orsum results. (optional, by default file names are used)
  • --outputFolder: Path for the output result files. If it is not specified, results are written to the current directory. (optional, default=".")
  • --maxRepSize: The maximum size of a representative term. Terms larger than this size will not be discarded but also will not be able to represent other terms. (optional, default is a number larger than any annotation term, which means that it has no effect)
  • --maxTermSize: The maximum size of the terms to be processed. Larger terms will be discarded. (optional, default is a number larger than any annotation term, which means that it has no effect)
  • --minTermSize: The minimum size of the terms to be processed. Smaller terms will be discarded. (optional, default=10)
  • --numberOfTermsToPlot: The number of representative terms to be presented in barplot and heatmap. (optional, default=50)

Example command:
orsum.py --gmt 'hsapiens.GO:BP.name.gmt' --files 'Enrichment-GOBP.txt' --outputFolder 'OutputGOBP'

Example command:
orsum.py --gmt 'hsapiens.REAC.name.gmt' --files 'Enrichment-Method1-Reac.txt' 'Enrichment-Method2-Reac.txt' 'Enrichment-Method3-Reac.txt' --fileAliases 'Method 1' 'Method 2' 'Method 3' --outputFolder 'OutputReac' --maxRepSize 2000 --maxTermSize 3000 --minTermSize 20 --numberOfTermsToPlot 20


If you use orsum, please cite our publication:

Ozisik O, Térézol M, Baudot A. orsum: a Python package for filtering and comparing enrichment analyses using a simple principle. BMC Bioinformatics. 2022 Jul 23;23(1):293. doi: 10.1186/s12859-022-04828-2.



This work is conducted in the Networks and Systems Biology for Diseases group of Anaïs Baudot (https://www.marseille-medical-genetics.org/a-baudot/).