Skip to content

correct

Michael Alonge edited this page Nov 1, 2021 · 11 revisions

RagTag Version: v2.1.0

descriptive diagram

RagTag offers a correction module that uses a reference genome to identify and correct potential misassemblies in a query assembly. RagTag also provides the option to verify putative misassemblies by aligning reads (from the same genotype) to the query assembly and observing read coverage near misassembly break points. In all cases, sequence is never added or subtracted. Query sequences are only broken at points of putative misassembly.

Usage

usage: ragtag.py correct <reference.fa> <query.fa>

Homology-based misassembly correction: Correct sequences in 'query.fa' by comparing them to sequences in 'reference.fa'>

positional arguments:
  <reference.fa>        reference fasta file (uncompressed or bgzipped)
  <query.fa>            query fasta file (uncompressed or bgzipped)

optional arguments:
  -h, --help            show this help message and exit

correction options:
  -f INT                minimum unique alignment length [1000]
  --remove-small        remove unique alignments shorter than -f
  -q INT                minimum mapq (NA for Nucmer alignments) [10]
  -d INT                maximum alignment merge distance [100000]
  -b INT                minimum break distance from contig ends [5000]
  -e <exclude.txt>      list of reference headers to ignore [null]
  -j <skip.txt>         list of query headers to leave uncorrected [null]
  --inter               only break misassemblies between reference sequences
  --intra               only break misassemblies within reference sequences
  --gff <features.gff>  don't break sequences within gff intervals [null]

input/output options:
  -o PATH               output directory [./ragtag_output]
  -w                    overwrite intermediate files
  -u                    add suffix to unaltered sequence headers

mapping options:
  -t INT                number of minimap2/unimap threads [1]
  --aligner PATH        whole genome aligner executable ('nucmer', 'unimap' or 'minimap2') [minimap2]
  --mm2-params STR      space delimited minimap2 whole genome alignment parameters (overrides '-t') ['-x asm5']
  --unimap-params STR   space delimited unimap parameters (overrides '-t') ['-x asm5']
  --nucmer-params STR   space delimted nucmer whole genome alignment parameters ['--maxmatch -l 100 -c 500']

validation options:
  --read-aligner PATH   read aligner executable (only 'minimap2' is allowed) [minimap2]
  -R <reads.fasta>      validation reads (uncompressed or gzipped) [null]
  -F <reads.fofn>       same as '-R', but a list of files [null]
  -T STR                read type. 'sr', 'ont' and 'corr' accepted for Illumina, nanopore and error corrected long-reads, respectively [null]
  -v INT                coverage validation window size [10000]
  --max-cov INT         break sequences at regions at or above this coverage level [AUTO]
  --min-cov INT         break sequences at regions at or below this coverage level [AUTO]   

correction options

RagTag 'correct' breaks sequences in <query.fa> when they discordantly map to <reference.fa>. These files can be uncompressed or bgzipped. Use -e to provide a single column file listing any reference.fa headers that should be ignored (e.g. chr0/chrUn or alt contigs). Similarly, use -j to provide a single column file listing any query.fa headers that shall not be broken. If an alignment is not entirely unique, at least -f bp of the alignment must be unique to be considered for scaffolding. By default, entirely unique alignments are considered regardless of their length, but this can be disabled with --remove-small. Doing so ensures that only alignments at least -f bp in length are considered for correction. -q sets the minimum Minimap2/Unimap mapq score for alignments. For each query sequence, syntenic alignments within -d bp of each other are merged into longer alignments. Breaks will not be made within -b bp of query sequence termini.

One can also direct RagTag to only break misassemblies between (--inter, query maps to >1 reference sequence) or within (--intra, query maps discordantly to 1 reference sequence) reference sequences. If one has annotations associated with the query assembly, provide them with the --gff option to ensure that the query assembly is never broken within annotation intervals. -gff allows users to update GFF coordinates with respect to the new broken assembly using updategff.

input/output options

By default, RagTag places all output and intermediate files in a directory named ragtag_output , but this can be changed with -o. RagTag will not overwrite intermediate files that already exist in the output directory. This is to save time producing expensive alignment files. Users can set -w to overwrite any preexisting files.

Use the -u option to add the "_RagTag" suffix to each sequence in the output, even uncorrected query sequences that have not changed. This ensures AGP compatibility with some external programs/databases. If one wants uncorrected query sequences to retain their original header, do not use -u.

mapping options

Use -t to set the number of threads Minimap2 or Unimap uses for mapping (overridden by --mm2-params and --unimap-params). This option does not apply to Nucmer alignments. If the aligner executable is not in one's PATH, or one would like to use Nucmer or Unimap instead of Minimap2, use the --aligner option to specify the PATH of the appropriate aligner executable. The --mm2-params, --unimap-params, and --nucmer-params options allow one to specify custom alignment parameters for Minimap2, Unimap, and Nucmer, respectively.

validation options

Use these validation options to verify putative misassemblies by querying read coverage near misassembly break points. Without validation, the module will break at any point of reference discordance as defined by the "correction options". With validation, RagTag maps reads to the query assembly and verifies putative break points if they are near regions of exceptionally low or high coverage. The reads (-R/-F) used for validation should come from the same genotype as the query assembly to ensure that coverage abnormalities don't arise from true biological variation. RagTag correction only accepts either short reads, Oxford Nanopore long reads (ONT), or error-corrected long reads (such as PacBio CCS) (-T).

One can adjust the sensitivity of misassembly validation to reduce false positives. -v specifies the window around the putative misassembly break point that RagTag examines for exceptionally low or high read coverage. The larger this window size, the more likely it is to find an unrelated coverage abnormality. One can also define low/high coverage thresholds with --max-cov and --min-cov.

RagTag can only use minimap2 for read alignment. If you don't have the minimap2 executable in your PATH, you can specify the path with --read-aligner.

Output

All output is in ragtag_output, or whichever directory -o specifies.

ragtag.correct.fasta

The corrected query assembly in FASTA format.

ragtag.correct.agp

The AGP file defining the exact coordinates of query sequence breaks.

The "object" AGP field represents the original query sequences, while the "component" AGP field represents the broken query subsequences. If a query sequence was not broken, it will be represented as a single AGP line where the object and query share the same original sequence header. Some programs/databases don't like when the component and object are the same, so use the -u option to make the object header distinct from the component header (will also be reflected in the FASTA file), even though they represent the same sequence.

If -gff was used during correction, use this AGP file to update the GFF coordinates to refer to the new broken query assembly.

Misassemblies vs. True Variation

Reference-guided misassembly signatures are sometimes caused by true biological structural variation if the reference and query assemblies represent distinct genotypes (or haplotypes). The read validation feature should help to avoid some of these misassembly false positives, and the validation sensitivity can be tuned with command line parameters. However, it is ultimately up to the discretion of the user to decide if misassembly correction is appropriate. One should validate all RagTag results with independent data (usually physical, optical, or genetic maps), when possible.