Skip to content

pontushojer/NAIBR

 
 

Repository files navigation

CI codecov Code style: black install with bioconda

NAIBR - Novel Adjacency Identification with Barcoded Reads

Note This is a re-factoring of the code from raphael-group/NAIBR to make the tool a bit more user-friendly and adds some additional features. This means that the resulting output might differ from what you would get from raphael-group/NAIBR. See Benchmarks for a comparison.

Contents

Overview

NAIBR (Novel Adjacency Identification with Barcoded Reads) identifies novel adjacencies (NAs) created by structural variation events such as deletions, duplications, inversions, and complex rearrangements using linked-read whole-genome sequencing data produced by 10X Genomics. Please refer to the publication for details about the method.

NAIBR takes as in put a phased BAM, such as those produced by 10X Genomic's Long Ranger pipeline, and outputs a BEDPE file containing predicted novel adjacencies and a likelihood score for each adjacency.

Install

Install using pip:

pip install git+https://github.com/pontushojer/NAIBR.git

This fork of NAIBR is also available from binconda under the name naibr-plus. Install using conda:

conda install naibr-plus

Running NAIBR

NAIBR can be run using the following command:

naibr <configfile>

A template config file can be found in example/example.config. The following parameters can be set in the config file:

  • bam_file: Input phased BAM file with BX (and HP) tagged reads (REQUIRED)
  • min_mapq: Minimum mapping quality for a read to be included in analysis (default: 40)
  • outdir: Output directory (default: . i.e. current workding directory)
  • prefix: Prefix for output files, set to empty sting to match bam_file name (default: NAIBR_SVs)
  • d: The maximum distance in basepairs between reads in a linked-read (default: 10000)
  • blacklist: BED-file with regions to be excluded from analysis
  • candidates: BEDPE-file with novel adjacencies to be scored by NAIBR. This will override automatic detection of candidate novel adjacencies
  • threads: Number of threads (default: 1)
  • min_sv: Minimum size of a structural variant to be detected (default: lmax, i.e. the 95th percentile of the paired-end read insert size distribution)
  • k: minimum number of barcode overlaps supporting a candidate NA (default = 3)

Run example

Example files are provided in the 'example' directory. Running

naibr example/example.config

will produce the files example/NAIBR_SVs.bedpe, example/NAIBR_SVs.reformat.bedpe and example/NAIBR_SVs.vcf.

Output

Scored novel adjacencies are outputted in three different file formats to the output directory

<prefix>.bedpe

BEDPE-like file in same format as outputted in raphael-group/NAIBR. Note that this file does not follow any actual BEDPE format. The file has a header with column descriptions.

Columns:

  1. Chr1: Chromosome of 1st breakpoint
  2. Break1: Position of 1st breakpoint
  3. Chr2: Chromosome of 2nd breakpoint
  4. Break2: Position of 2nd breakpoint
  5. Split molecules: Number of split molecule supporting NA
  6. Discordant reads: Number of discordant reads supporting NA
  7. Orientation: Orientation of NA. "+-" ~ DEL, "++"/"--" ~ INV, "-+" ~ "DUP"
  8. Haplotype
  9. Score: log-likelihood score for NA
  10. Pass filter: "PASS" if passes filter threshold, otherwice "FAIL"

<prefix>.reformat.bedpe

Reformatted BEDPE that follows the 10x Genomics BEDPE format suitable for IGV visualization

<prefix>.vcf

NAs translated to the VCF format. Useful for operation using e.g. truvari.

Converting BEDPE to VCF

The repository includes a utility script bedpe_to_vcf.py for converting from NAIBR's BEDPE format to a VCF. For detail on how to use it run:

bedpe_to_vcf.py --help

Benchmarks

Benchmarking this fork (labeled main_<commit>) versus raphael-group/NAIBR (labeled master). Master includes raphael-group#22 merged as this removes a lot of FP calls introduced by a code error.

HG002

Run NAIBR on 10x Genomics data for idividual HG002 (NA24385). Dataset from GIAB (source) has 51.7x coverage and was run through the Long Ranger pipeline using reference genome GRCh37.

NAIBR configs used min_mapq=40 d=10000 min_sv=1000 threads=10 k=3 min_reads=2 min_discs=2

Compared deletions (DELs) to GIAB benchmark set Tier1 (v0.6) using truvari (v3.1.0). Compared both all entries (named *_all) and only those that passed filters (labeled *_pass).

Command for *_all:

truvari bench -b HG002_SVs_Tier1_v0.6.withchr.DELs.vcf.gz -c master/NAIBR_SVs.DELs.vcf.gz -o truvari_bench/master_all --includebed HG002_SVs_Tier1_v0.6.withchr.noY.bed --pctsim 0.5 --pctsize 0.5 --reference /Users/pontus.hojer/analysis/references/hg19_longranger/genome.fa --sizemin 1000 --sizemax 20000000 --multimatch --giabreport

Command for *_pass:

truvari bench -b HG002_SVs_Tier1_v0.6.withchr.DELs.vcf.gz -c master/NAIBR_SVs.DELs.vcf.gz -o truvari_bench/master_pass --includebed HG002_SVs_Tier1_v0.6.withchr.noY.bed --pctsim 0.5 --pctsize 0.5 --reference /Users/pontus.hojer/analysis/references/hg19_longranger/genome.fa --sizemin 1000 --sizemax 20000000 --multimatch --giabreport --passonly

Results

Sample Precision Recall F1 score TP-base TP-call FP FN
main_cbbdf01_all 67.0% 59.6% 63.1% 480 478 235 325
main_cbbdf01_pass 99.3% 77.4% 87.0% 411 411 3 120
master_all 34.3% 28.6% 31.2% 230 229 439 575
master_pass 96.3% 14.9% 25.8% 79 79 3 452

This fork, i.e. main_cbbdf01_pass, had much higher F1 score of 87.0% than the master branch at 25.8%. Also the main_cbbdf01_all seems to include a lot of TP calls that are filtered out by NAIBR, possibly filtering should be re-optimized.

Development

Create a new release

Tag the commit with the new version number, e.g. v0.1.0:

git tag <version>

Push tag

git push origin <version>

Build package

python3 -m pip install --upgrade build
python3 -m build

This generates a file dist/NAIBR-<version>.tar.gz.

Create a release on GitHub and upload the dist/NAIBR-<version>.tar.gz for the new version.

Citing

Elyanow, Rebecca, Hsin-Ta Wu, and Benjamin J. Raphael. Identifying structural variants using linked-read sequencing data. Bioinformatics (2017).

@article{elyanow2017identifying,
  title={Identifying structural variants using linked-read sequencing data},
  author={Elyanow, Rebecca and Wu, Hsin-Ta and Raphael, Benjamin J},
  journal={Bioinformatics},
  year={2017}
}

About

Novel Adjacency Identification with Barcoded Reads

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 97.2%
  • Shell 2.8%