Genome Assembly Validation via Inter-SUNK distances in nanopore reads
Setup source files in config.yaml and ont.tsv
config.yaml requires:
- a manifest of ONT data for validation "ont.tsv",
- and the kmer-length to use for validation "SUNK_len" (default: 20)
optionally, for more detailed output plots (uses more RAM, experimental), set plot_detailed: True
in config.yaml
The ONT manifest is a .tsv file "ont.tsv" with the following columns:
- sample: must be unique for each file
- hap[1,2]_ONT : location of ONT reads (gzipped or raw .fa or .fq)
- hap[1,2]_asm : location of haplotype assembly (FASTA, must also have .fai index in same directory)
- hap[1,2]_bed : BED format regions of assembly to visualize (optional)
- hap[1,2]_colotrack : BED format color track to include in visualizations (optional, no headers, columns Chrom Start End Color)
I recommend a single input file for each haplotype, and to haplotype phase with canu and parental illumina (not currently incorporated into this pipeline). Hi-C phasing of ONT is also possible: see the "hic" branch.
To run snakefile locally on the provided test cases (AMY locus of CHM13/1 pseudodiploid and HG02723), generating optional image outputs, execute:
git clone https://github.com/pdishuck/GAVISUNK
cd GAVISUNK
snakemake -R --use-conda --cores 8 --configfile .test/config.yaml --resources load=1000
.BED results are found in the results/[sample]/final_outs
directory, along with .tsv files summarizing the probability of each validation gap having been spanned by ONT reads, given the input library
Automated visualizations of validation gaps are found in the results/pngs/gaps/[sample]
directory
For the included test cases, the following output should be generated: results/gaps/AMY_HG02723/AMY_HG02723_hap1_AMY_h1_84861_524275.png
, corresponding to the region displayed in Figure 1 of the manuscript:
alias snakesub='mkdir -p log; snakemake --ri --jobname "{rulename}.{jobid}" --drmaa " -V -cwd -j y -o ./log -e ./log -l h_rt=48:00:00 -l mfree={resources.mem}G -pe serial {threads} -w n -S /bin/bash" -w 60'
snakesub --use-conda -j150 > snakelog 2>&1 &
To get snakemake conda envs to work correctly, you may need to deactivate your local conda env ($PATH issues as in snakemake/snakemake#883) GAVISUNK is written to execute from its top-level directory. Old versions of dependencies that are in your $PATH by default may interfere with GAVISUNK operation. See dependencies in workflow/envs/viz.yaml
Philip C Dishuck, Allison N Rozanski, Glennis A Logsdon, David Porubsky, Evan E Eichler, GAVISUNK: genome assembly validation via inter-SUNK distances in Oxford Nanopore reads, Bioinformatics, Volume 39, Issue 1, January 2023, btac714, https://doi.org/10.1093/bioinformatics/btac714