Skip to content

Latest commit

 

History

History
81 lines (49 loc) · 3.43 KB

index.md

File metadata and controls

81 lines (49 loc) · 3.43 KB

RohHunter documentation

RohHunter is a tool for run of homozygosity (ROH) detection based on a variant list in VCF format.

RohHunter uses the allele frequency of variants to calculate the probability to see a ROH by chance.
Allele frequency information can be annotated to the variant list via Ensembl VEP.

Algorithm description

These are steps the RohHunter algorithm performs:

  1. Filter variants (markers) by quality to remove false genotype calls:
    • Depth (default: ≥20)
    • Variant Q score (default: ≥30)
  2. Determine raw stretches of homozygous markers
  3. Assign probability to observe ROH by chance
    • based on allele frequency, e.g. using 1000g and gnomAD
  4. Remove regions with low probability (default: <Q30)
  5. Merge adjacent ROHs based on
    • distance in markers (default: ≤1 or ≤1% of ROH marker count)
    • distance in bases (default: ≤50% of ROH base count)
  6. Filter based on
    • Number of markers (default: ≥20)
    • Size (default: ≥20Kb)

Example

The following image visualizes the algorihtm and show how it copes with a genotyping error (at the start of exon 2):

algorithm

Using external allele frequency sources

Instead of using VEP annotations as source of allele frequency information, an external database of allele frequencies can be provided via the 'af_source' parameter.
We suggest to use genomAD in version 3.1 or higher as allele frequency database.

Pre-processing of the external database

It is important to normalize the allele frequency database (and the variant list) so that most variants can be annotated with allele frequency:

  1. Split multi-allelic variants to several rows, e.g. with VcfBreakMulti.
  2. Left-align InDels e.g. with VcfLeftNormalize.
  3. Sort variants according to position, e.g. with VcfStreamSort.

Finally, the allele frequency database has to be compressed with bgzip and index with tabix.

Run-time using an external database

Using an exteral allele frequency database increases the run-time of the tool, since all variants have to be looked up in the database.

Our benchmarks show the following runtime increase when using the genomAD genome database:

  • Exome (60K variants) from 4.3s (annotated) to ~100s.
  • Genome (4.8M variants) from 3.3m (annotated) to ~90m.

Thus, for genomes it is favorable to use annotated variants lists if available.

ROHs and consanguinity

Many large ROHs in a child can be a indicator for consanguinity of the parents.

This plot shows the ROH size sum of ROHs larger than 500kb for WGS (Illumina TruSeq DNA PCR-Free): rohs_wes

This plot shows the ROH size sum of ROHs larger than 500kb for WES (Agilent SureSelect Human All Exon V7): rohs_wes

This plot shows the ROH size sum of ROHs larger than 500kb for patients with different degrees of consanguinity:

roh_sum_consanguinity

It is pretty clear from the plots that a ROH size sum larger than 75Mb is a pretty good indicator for consanguinity of the parents.

Help and ChangeLog

The RohHunter command-line help and changelog can be found here.

back to ngs-bits