Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Fusion Detection Overlap



Fusion detection overlap merges fusion output data from Arriba, CICERO, FusionMap, FusionCatcher, JAFFA, MapSplice, SOAPfuse, STAR-Fusion, TopHat-Fusion, and DRAGEN RNA-Seq.

Getting Started

Download code

Download the overlap code from GitHub. Use git to clone the repo or navigate to the nch-igm-ensemble-fusion-detection GitHub page, download, and unzip the code.

git clone

Install dependencies

The main dependencies are R (R Setup) and a Postgres database (Postgres Database Setup).

R Setup

Install R (version 3.5 is preffered since the code is tested on this version). Visit the R website for installation instructions or use your preferred install tool.

R Packages

The following packages are required for fusion detection overlap.

Start an R session.


Run these commands in the R session.

install.packages('devtools', repos='')
install.packages('optparse', repos='')
install.packages('dplyr', repos='')
install.packages('readr', repos='')
install.packages('DBI', repos='')
install.packages('RPostgres', repos='')
install.packages('dbplyr', repos='')
install.packages('magrittr', repos='')
install.packages('purrr', repos='')

Postgres Database Setup

Overlap interacts with a database of fusions. The database is used to store previously called fusions. The overlap code queries the database for the frequency of reported fusions. Fusion calls with a high frequency are regarded as artifacts or unlikely to cause disease.

Step 1: Install Postgres

On macOS, you can use Homebrew to install Postgres.

brew install postgres

On Linux or Windows, you will need to find alternative instructions.

Step 2: (Initialize and) Start the database

On macOS, Homebrew automatically initializes the space where the databases will be written on your filesystem at /usr/local/var/postgres. Users may experience different results on different systems. You may need to initialize the database at /usr/local/var/postgres or another location or you may not need to initialize the database at all.

initdb /usr/local/var/postgres

Start the database.

pg_ctl -D /usr/local/var/postgres start
Step 3: Setup the fusions database with test data

Create the fusions database.

createdb fusions

Create the postgres user if it doesn't already exist.

createuser -s postgres

Import the schema and test data for the fusions database. You may need to modify the host.

cd nch-igm-ensemble-fusion-detection
psql -h localhost fusions < sql/fusions_example_db_dump.sql

Modify the read-only and read-write users' passwords in the create_users.sql file.

vim sql/create_users.sql  # set passwords

Import the read-only and read-write users.

psql -h localhost fusions < sql/create_users.sql
Step 4: Setup the .dbconfig.R file

R/.dbconfig.R sets the configuration variables that the overlap code uses to connect to the database. R variables dbname, host, port, user, and passwd must be set in this file so that the overlap code can connect to the database.

echo 'dbname <- "fusions"
host <- "localhost"
port <- 5432
user <- "fusion_user_rw"
passwd <- "PASSWORD_RW_HERE"' > R/.dbconfig.R

Running Overlap from using bash

In order for the overlap script to run properly, there is an expected file structure hierarchy, please see

the Example below for assemble_results or view example data in "test_data" directory

Run '' to run the necessary R scripts (described in detail below)

overlap/ -h

The result:

[USAGE]: This script kicks off a script that merges fusion detection results from many tools.

-h * [None] Print this help message.
-p * [path1,path2,...] paths  where fusion results are where the script will be kicked off.
-s * [samples_file1,samples_file2,...] Files paths that contain the lists of samples to kickoff scripts for. The number of samples files provided must be equal to the number of paths, and they must be listed in the same respective order. If not specifically added, the script will search through sample names that are listed in the samples file within the path (-p) given[Default: samples,samples,...].

Note that -s is not required, and if not added the script will automatically run on all samples listed in your samples file

Try with test data:

overlap/ -p /test_data

Results (will show you which callers it found output for):

WARNING: ignoring environment value of R_HOME
Assembling a list of files to merge...
[1] "starFusion"    "fusionMap"     "fusionCatcher" "jaffa"        
[5] "mapSplice"     "dragen"       

Explanation of and option to run R code directly:

Run assemble_results.R to merge results and generate the overlap report. See the help message for assemble_results.R by running

R/assemble_results.R -h


Rscript R/assemble_results.R -h

The result

Usage: assemble_results.R --sample s1

assemble_results.R reads a set of fusion detection result files and produces an aggregated report with the overlapping fusion events predicted in the input files. The program searches recursively from the current directory to find known files. The input files can also be specified manually.

		Name of the sample. (required)

		Base directory to search for input files. [default = .]

		Location of the output report. [default = overlap_$sample.tsv]

		Location of the output report. [default = collapsed_3callers_$sample.tsv]

		Location of the output report. [default = filtered_overlap_2callers_$sample.tsv]

		Location of the output report. [default = filtered_overlap_3callers_$sample.tsv]

		Location of the Singleton output report. [default = Singleton_KnownFusions_$sample.tsv]

		Location of the breakpoint report. [default = breakpoints_$sample.tsv]

		Path to the dragen results file. [default = search for file named like '']

		Path to the fusionCatcher results file. [default = search for file named like 'final-list_candidate-fusion-genes.txt']

		Path to the fusionMap results file. [default = search for file named like 'FusionDetection.FusionReport.Table.txt']

		Path to the jaffa results file. [default = search for file named like 'jaffa_results.csv']

		Path to the mapSplice results file. [default = search for file named like 'fusions_well_annotated.txt']

		Path to the soapFuse results file. [default = search for file named like '.final.Fusion.specific.for.genes']

		Path to the starFusion results file. [default = search for file named like 'star-fusion.fusion_predictions.abridged.tsv']

		Path to the tophatFusion results file. [default = search for file named like 'result.txt']. potential_fusion.txt must be in the same directory as result.txt for the import to work.

		Specific location of the arriba results file (fusions.tsv).

		Specific location of the cicero results file (annotated.fusion.txt).

	-h, --help
		Show this help message and exit```

### `--sample`

`--sample` is the only required argument for `assemble_results.R`.

R/assemble_results.R --samples s1

--sample is used to insert a comment into the top of the output files

# Sample: s1
# NumToolsAggregated: 5
# - starFusionCalls = 20
# - fusionCatcherCalls = 552
# - jaffaCalls = 948
# - mapSpliceCalls = 23
# - dragenCalls = 9

and name the output files

--outReport      	= overlap_$sample.tsv
--collapsedoutReport	= collapsed_3callers_$sample.tsv
--foutReport     	= filtered_overlap_2callers_$sample.tsv
--foutReport3    	= filtered_overlap_3callers_$sample.tsv
--outSingleton		= Singleton_KnownFusions_$sample.tsv
--outBreakpoints 	= breakpoints_$sample.tsv

You can override the default output file names with the --*out* parameters.


The --baseDir argument sets the location of the input data. By default, --baseDir is set to the current directory ., but you can override the default location

R/assemble_results.R --sample s1 --baseDir fusion_results

kickoff_overlap.R recursively searches all directories under the baseDir for fusion detection results files. The default search looks for

--arriba        = fusions.tsv
--cicero        = annotated.fusion.txt
--dragen        =
--fusionCatcher = final-list_candidate-fusion-genes.txt
--fusionMap     = FusionDetection.FusionReport.Table.txt
--jaffa         = jaffa_results.csv
--mapSplice     = fusions_well_annotated.txt
--soapFuse      = .final.Fusion.specific.for.genes
--starFusion    = star-fusion.fusion_predictions.abridged.tsv
--tophatFusion  = result.txt (potential_fusion.txt must be in the same directory as result.txt)

For each tool, overlap uses the first occurence of a file that matches the pattern associated with that tool. If there are no files that match for the tool, then overlap assumes that the tool was not run.

Further info: kickoff_overlap.R uses the R function list.files with the pattern parameter set to the pattern. The pattern is used as a regular expression used to match file names. The soapFuse pattern .final.Fusion.specific.for.genes starts with a period because the actual output file is named [sample].final.Fusion.specific.for.genes. Our pattern will match soapFuse output files with different samples names, like or


Given the following directory structure


run the command

cd /data/fusion_output_1
~/igm-ensemble-fusion-detection/R/assemble_results.R --sample s1


cd ~/igm-ensemble-fusion-detection
R/assemble_results.R --sample s1 --baseDir /data/fusion_output_1

and overlap will be able to find fusion output files for fusionCatcher, fusionMap, starFusion, and dragen. Overlap will be able to find the correct files even if they are under subdirectories since the search is recursive.


--starFusion override example

You can override the default fusion output file search with direct fusion output file names.

R/assemble_results.R --samples s1 --starFusion path/to/custom-star-fusion-output.tsv

This command will recursively search under the current directory . for the dragen, fusionCatcher, fusionMap, etc., default files, but it will not perform the same search for the starFusion file. It will use the exact location to the starFusion file as provided by the user path/to/custom-star-fusion-output.tsv.

The overlap code only knows how to parse the data from the output files in the default list. Make sure your custom locations point to files that hold the same data. There may be multiple output files from each fusion caller. Our overlap code is only parsing certain files for certain data to generate the overlap report.

Upload to the database

Run to upload fusion results to the database.

overlap/ -h

The result

[USAGE]: Use this script to upload fusion detection results to the fusion detection database.

-h * [None] Print this help message.
-d * [results_dir1,results_dir2,...] List of directories (comma-separated) where the fusion results are located. Directories must be full paths. Ex: /path/to/fusion_results1,/path/to/fusion_results2.
-p * [patient1,patient2,...] List of patient IDs (comma-separated). The number of patient IDs must be equal to the number of results directories and they must be listed in the same respective order. Ex: patient1,patient2. (Also referred to as the 'subject ID'.)
-s * [sample1,sample2,...] List of sample IDs (comma-separated). The number of sample IDs must be equal to the number of results directories and they must be listed in the same respective order. Ex: sample1,sample2.
-r * [reads1,reads2,...] List of read pairs numbers (comma-separated). The number of read pairs numbers must be equal to the number of results directories and they must be listed in the same respective order. Optional. Ex: 40000000,52000000.

[Example 1]: -d /path/to/fusion_results1 -p p1 -s s1
[Example 2]: -d /path/to/fusion_results1,/path/to/fusion_results2 -p p1,p2 -s s1,s2
[Example 3]: -d /path/to/fusion_results1 -p p1 -s s1 -r 42543879

The -d results directory must contain fusion detection results files. The results directory usually is the top-level directory for a sample where multiple fusion caller results are stored in sub-directories. The upload code recursively searches all directories under the results directory for fusion detection results files. See the example above for an example results directory layout. can take a single results directory as input (for results from multiple tools on a single sample) or multiple results directories as input (to make the upload process faster for multiple samples).

The -p patient ID and -s sample ID are necessary identifiers to upload the data to the database. The -r number of read pairs is optional, depending on whether you want to store that information.


The database uses the term 'analysis' to refer to a fusion caller execution on a single sample. A STAR-Fusion run on sample1 is considered an analysis, say analysis1. A JAFFA run on sample1 is considered a different analysis, say analysis2.

Analysis upload process

When uploading data to the database, the upload script uses the concept of an analysis to decide whether the data already exists in the database. If the analysis already exists in the database, then the upload script will skip the upload for that analysis. If there is already STAR-Fusion data for sample1 in the database (analysis1), then any additional attempt to upload or replace STAR-Fusion data for sample1 will fail.

  • Upside - Since an analysis will not be uploaded twice, you can re-run the upload script on the same results directory and not have to worry about overwriting data. This can be useful if, for example, you run STAR-Fusion and JAFFA on sample1 and upload the results. At a later time, you decide to run FusionMap on the same sample. When you want to add the FusionMap results to the database, you can run the upload script on the sample1 results directory and it will only add the new FusionMap analysis to the database.
  • Downside - You cannot replace analysis data using the upload script. If, for example, you run a new version of STAR-Fusion on sample1, you might want to replace the old STAR-Fusion results in the database. The upload script does not handle this scenario and you will have to perform a manual replacement.

Upload with R instead of bash

Use upload_fusion_results.R for a greater level of control over the database upload (or if you don't have bash on your system to run

R/upload_fusion_results.R -h

The result

Usage: R/upload_fusion_results.R [options]

This script acts as an importer for fusion detection results to the fusion detection database.

                Subject ID, e.g. subject1. (required)
                Sample ID, e.g. sample1. (required)
                Phenotype of the subject. (optional)
                Storage method for the sample, such as 'Frozen' or 'FFPE'. (optional)
                Number of read pairs for the sample. (optional)
                Custom tool for which the results are being entered, possible values are 'dragen', 'FusionCatcher', 'FusionMap', 'Jaffa', 'MapSplice', 'SoapFuse', 'StarFusion', and 'TophatFusion'. (optional)
                Custom location of the results file for the associated tool, e.g. 'star-fusion.fusion_predictions.abridged.tsv'. (optional)
        -h, --help
                Show this help message and exit

upload_fusion_results.R must be run from the results directory. It searches for results files under .. Alternatively, you can upload results one-by-one by providing the --tool and --results parameters.


No description, website, or topics provided.







No packages published