Skip to content
Michael Alonge edited this page Oct 13, 2022 · 11 revisions

RagTag Version: v2.1.0

descriptive diagram

RagTag 'patch' uses one genome assembly to "patch" another genome assembly. We define two types of patches: Fills and Joins:

  • Fills are patches that fill assembly gaps. This process is like traditional gap-filling, though it uses an assembly instead of WGS sequencing reads.

  • Joins are patches that join distinct contigs. This is essentially scaffolding and gap-filling in a single step.

Usage

usage: ragtag.py patch <target.fa> <query.fa>

Homology-based assembly patching: Make continuous joins and fill gaps in 'target.fa' using sequences from 'query.fa'

positional arguments:
  <target.fa>          target fasta file (uncompressed or bgzipped)
  <query.fa>           query fasta file (uncompressed or bgzipped)

optional arguments:
  -h, --help           show this help message and exit

patching:
  -e <exclude.txt>     list of target sequences to ignore [null]
  -j <skip.txt>        list of query sequences to ignore [null]
  -f INT               minimum unique alignment length [1000]
  --remove-small       remove unique alignments shorter than '-f'
  -q INT               minimum mapq (NA for Nucmer alignments) [10]
  -d INT               maximum alignment merge distance [100000]
  -s INT               minimum merged alignment length [50000]
  -i FLOAT             maximum merged alignment distance from sequence terminus. fraction of the sequence length if < 1 [0.05]
  --fill-only          only fill existing target gaps. do not join target sequences
  --join-only          only join and patch target sequences. do not fill existing gaps

input/output options:
  -o PATH              output directory [./ragtag_output]
  -w                   overwrite intermediate files
  -u                   add suffix to unplaced sequence headers

mapping options:
  -t INT               number of minimap2/unimap threads [1]
  --aligner PATH       aligner executable ('nucmer' (recommended), 'unimap' or 'minimap2') [nucmer]
  --mm2-params STR     space delimited minimap2 parameters (overrides '-t') ['-x asm5']
  --unimap-params STR  space delimited unimap parameters (overrides '-t') ['-x asm5']
  --nucmer-params STR  space delimted nucmer parameters ['--maxmatch -l 100 -c 500']

patching options

RagTag 'patch' makes patches in <target.fa> using sequences from <query.fa>. These files can be uncompressed or bgzipped. Use -e to provide a single column file listing any <target.fa> sequences that should be ignored during patching (e.g. chr0/chrUn or alt contigs). Similarly, use -j to provide a single column file listing any <query.fa> sequences that shall not be used for patching. If an alignment is not entirely unique, at least -f bp of the alignment must be unique to be considered. By default, entirely unique alignments are considered regardless of their length, but this can be disabled with --remove-small. Doing so ensures that only alignments at least -f bp in length are considered. -q sets the minimum Minimap2/Unimap mapq score for alignments. For each query sequence, syntenic alignments within -d bp of each other are merged into longer alignments. After merging, alignments less than -s bp long will be removed. Alignments must be within -i bp of a target sequence terminus or gap to be considered for patching. With --fill-only invoked, RagTag will only fill gaps, and with --join-only invoked, RagTag will only make joins.

input/output options

By default, RagTag places all of the output and intermediate files in a directory named ragtag_output , but this can be changed with -o. RagTag will not overwrite intermediate files that already exist in the output directory. This is to save time producing expensive alignment files. Users can set -w to overwrite any preexisting files.

Use the -u option to add the "_RagTag" suffix to each sequence in the scaffold output, even unplaced query sequences that have not changed. This ensures AGP compatibility with some external programs/databases. If one wants unplaced query sequences to retain their original header, do not use -u.

mapping options

Use -t to set the number of threads Minimap2 or Unimap uses for mapping (overridden by --mm2-params and --unimap-params). This option does not apply to Nucmer alignments. Use the --aligner option to specify the PATH of the appropriate aligner executable (Nucmer is default and recommended). The --mm2-params, --unimap-params, and --nucmer-params options allow one to specify custom alignment parameters for Minimap2, Unimap, and Nucmer, respectively.

Output

File Description
ragtag.patch.agp The final AGP file defining how ragtag.patch.fasta is built
ragtag.patch.asm.* Assembly alignment files
ragtag.patch.comps.fasta The split target assembly and the renamed query assembly combined into one FASTA file. This file contains all components in ragtag.patch.agp
ragtag.patch.ctg.agp An AGP file defining how the target assembly was split at gaps
ragtag.patch.ctg.fasta The target assembly split at gaps
ragtag.patch.err Standard error logging for all external RagTag commands
ragtag.patch.fasta The final FASTA file containing the patched assembly
ragtag.patch.rename.agp An AGP file defining the new names for query sequences
ragtag.patch.rename.fasta A FASTA file with the original query sequence, but with new names

Using complete "query" assemblies

Highly complete T2T or near T2T assemblies are becoming more common. To use these (query) assemblies to patch other (target) assemblies, one must account for the large, high-similarity repeats they often contain.

RagTag "patch" identifies potential patches by finding unique alignments between one query contig and at least two target contigs. Large repeats can cause large gaps in these unique alignments, thus disqualifying any potential patches suggested by that query sequence. We recommend two techniques that can help mitigate these false negatives.

Adjusting -i

The -i parameter controls the maximum alignment break length. By increasing the -i parameter, RagTag can tolerate longer breaks in unique alignments. While this can improve recall, relaxing this parameter may reduce precision.

Breaking the query assembly

One can break the query assembly at large repeats, excluding the repeats. This will necessarily eliminate associated long stretches of non-unique alignments, thus improving recall.