Skip to content

nick297/ADAPTIVEMASKING

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ADAPTIVEMASKING

We describe tools to examine NGS-mapped output to identify regions of high minor variant frequencies, as described. The rationale of doing this is the identification of regions where consensus basecalling may be unreliable. The process is 'adaptive' in the sense that it models variation which arises during the entire laboratory, sequencing and mapping process and can be used to identify regions of variation resulting anywhere in the process followed. In our case, increased variation occurred in regions of homology between Mycobacteria and other bacterial species, and was detected when laboratory processes started to use broth-culture derived DNA extracts, rather than extracts from pure cultures. Use of the technique described here allowed identification and masking of the problem regions.

Obtain software and test data
The software needed, and instructions on how to obtain test data, is described here.

Overview of the process followed

  1. Mapping and VCF generation
    Our approach is designed to operate on the output from multiple mappers, with different settings, and with different kinds of input data. The objective of the project is quantify the variation associated with the totality of laboratory process, input DNA, sequencing, and mapping. Many pipelines will have already optimised mapping tools and settings; the process we describe will operate on their output.
    As input it expects VCF/BCF files, which are typically generated by samtools/bcftools mpileup commands following mapping. In particular, it expects a tag in the VCF INFO section which contains the high quality base counts at each position: it is these which are the input to the algorithm. For more detail on this, see here.
    Please see example code illustrating of mapping and vcf generation operations on a test dataset.
  2. Estimating the amount of extraneous bacterial DNA present
    One of the key findings from our paper is that mapping accuracy for some regions is determined by the amount of 'non-target' bacteria DNA present, in our case from species other than Mycobacteria. We estimated this using Kraken. Newer tools producing data in a similar format can also be used. Please see example code illustrating the use of Kraken, including the KrakenReportReader class from the KrakenReportReader module.
  3. Determining the minor variant frequencies for genomic regions
    We measured the minor variant frequencies, - that is, the read depth accounted for by calls other than the most common nucleotide - across genomic regions. Please see here, which describes use of the regionScan_from_genbank class, which is in the vcfScan module.
  4. Modelling minor variant frequencies in reads mapped to genomic regions
    Subsequently, we fitted Poisson models, region-by_region, estimating the relationship between the minor variant call depth (~ amount of mixture ) detected and an estimate of the amount of non-bacterial DNA present. A python class, AdaptiveMasking, in the AdaptiveMasking module, is provided to do this. Its use is described in detail
  5. Depicting the results of the modelling
    Methods in the AdaptiveMasking class allow depiction of model output, as described.
  6. Masking based on the results
    What to do next

About

Examine NGS mapped output in VCF/BCF files to identify regions of high variation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.9%
  • Shell 1.1%