KNOT: Knowledge Network Overlap exTraction is a tool for the investigation of fragmented long read assemblies.
KNOT is described in article Graph analysis of fragmented long-read bacterial genome assemblies accepted in Bioinformatics (preprint version)
Give an assembly and a set of reads to KNOT, it will output an information-rich contig graph in CSV format that tells you about adjacencies between contigs.
You can find a demo dataset and instructions for using it, in the demo folder of this repository.
- long reads (corrected or not) FASTA (no FASTQ allowed)
- contigs (in fasta) from the same assembler
- (this one is optional:) assembly graph (in gfa1) produced by an assembler
KNOT outputs an Augmented Assembly Graph (AAG). The AAG is a directed graph where nodes are contigs. An edge is present if two contigs overlap or, if in the original string graph of the reads, there exists a path between extremities of both contigs.
The AAG is present in the {output prefix}_AAG.csv
. (Note: the other GFA files produced by KNOT are not the AAG)
We recommend that you use the HTML report to look at the AAG first (see below on how to generate that report) but the raw CSV file can also be parsed directly.
Output AAG format is in CSV format with 8 column:
- tig1: tig name and extremity use in format {tig name}_{extremity} e.g. tig00000001_begin
- read1: read id use to search path for tig1 extremity
- tig2: other tig name and extremity
- read2: read id use to search path for tig2 extremity
- nb_read: nb_read in path between read1 and read2 (include)
- nb_base: nb_base in path between read2 and read2
- paths: id of read in path found between read1 and read2, separated by ;
- nbread_contig: number of read assign for each contig in format {tig name}:{nb of read in paths assign to contig}/{nb of read in tig} not_assign used to read not assigned to a contig, separated by ;.
This output can be used to manually investigate the result of an assembly. Short paths between contigs are likely true adjacencies. Long paths are likely repeat-induced.
More information about other file generated by knot are available in output description
Assume that
- long reads are stored in
raw_reads.fasta
- contigs are stored in
contigs.fasta
- (optional) contig graph is stored in
contigs.gfa
Then run KNOT as:
knot -r raw_reads.fasta -c contigs.fasta [-g contigs.gfa] -o {output prefix} -- -j {number of paralelle jobs avaible} [any snakemake parameter you like]
knot will run a snakemake pipeline and produce {output prefix}_AAG.csv
see output section for more details, and a directory {output prefix}_knot
where intermediate file are store.
You can use corrected long reads in place of raw_reads with -m
option.
Full command line usage:
usage: KNOT [-h] -c CONTIGS [-g CONTIGS_GRAPH]
(-r RAW_READS | -C CORRECT_READS) -o OUTPUT
[--search-mode {base,node}]
[--contig-min-length CONTIG_MIN_LENGTH] [--read-type {pb,ont}]
[--help-all]
optional arguments:
-h, --help show this help message and exit
-c CONTIGS, --contigs CONTIGS
fasta file than contains contigs
-g CONTIGS_GRAPH, --contigs_graph CONTIGS_GRAPH
contigs graph
-r RAW_READS, --raw-reads RAW_READS
read used for assembly
-C CORRECT_READS, --correct-reads CORRECT_READS
read used for assembly
-o OUTPUT, --output OUTPUT
output prefix
--search-mode {base,node}
what path search optimize, number of base or number of
node
--contig-min-length CONTIG_MIN_LENGTH
contig with size lower this parameter are ignored
--read-type {pb,ont} type of input read, default pb
--help-all show knot help and snakemake help
In addition, snakemake parameters can be add after --
.
You can generate a html report knot_report.html
on knot information generate previously with this command:
knot.analysis -i {output prefix give to knot previously} -c -p -o knot_report.html
If -c
is present, knot.analysis run a path classification, based on path length and composition see manuscript for more details.
If -p
is present, knot.analysis run a hamilton path search, see manuscript for more details.
Recommended solution (1 command, 2 minutes)
If bioconda channel is setup you have just to run this command:
conda install knot
wget https://raw.githubusercontent.com/natir/knot/v1.3/conda_env.yml
GIT_LFS_SKIP_SMUDGE=1 conda env create -f conda_env.yml
Activate environement :
conda activate knot_env
Unactivate environement :
conda deactivate
Requirements:
- python >= 3.6
- snakemake >= 5.3
- yacrd avaible in bioconda or cargo >= 0.6
- fpa avaible in bioconda or cargo >= 0.5
- minimap2 avaible in bioconda
Instruction:
GIT_LFS_SKIP_SMUDGE=1 pip3 install git+https://github.com/natir/knot.git
The recommended way to update this tool is to remove the conda environement and reinstall it :
source deactivate
conda env remove -n knot_env
wget https://github.com/natir/knot/raw/master/conda_env.yml
conda env create -f conda_env.yml
pip3 install --upgrade git+https://github.com/natir/knot.git
This tool has mainly be tested on bacterial genomes only, where it takes 30 minutes to run (in most case). In principle it should also run on larger genomes. But then we expect that the produced augmented assembly graphs will need to be automatically parsed, as their visualization will be more challenging.
Legend:
- input
#2D882D
- minimap2
#AA3939
- fpa
#AA7539
- yacrd
#27556C
- output
#5D2971
- pipeline internal tool
#FFD300
If you run knot with raw reads:
{output prefix}_AAG.csv # AAG result in format describe earlier
{output prefix}_knot # knot working directory
├── contigs.fasta # symbolic link to contig sequence provide as input
├── contigs_filtred.fasta # contigs keept in analysis filter on length
├── contigs_filtred.gfa # corresponding graph generated by fpa (from contigs_filtred.paf, no containment, no internal match)
├── contigs_filtred.paf # corresponding paf file made using minimap2
├── contigs_graph.gfa # symbolic link to contig graph provide as input
├── ext_search.csv # read associated to each contig extremity
├── raw_reads.fasta # symbolic link to raw read provide as input
├── raw_reads.paf # self mapping of raw_reads
├── raw_reads_splited.fasta # raw reads without not covered sequence provide by yacrd
├── raw_reads_splited.gfa # overlap graph generate by fpa on raw_reads_splited self mapping
├── raw_reads_splited.paf # self mapping of raw_reads_splited
├── raw_reads.yacrd # yacrd output on raw_reads
└── read2asm.paf # mapping of read on contigs_filtred
If you run knot with corrected reads:
{output prefix}_AAG.csv # AAG result in format describe earlier
{output prefix}_knot # knot working directory
├── contigs.fasta # symbolic link to contig sequence provide as input
├── contigs_filtred.fasta # contigs for which we have found overlaps between them
├── contigs_filtred.gfa # corresponding graph generated by fpa (from contigs_filtred.paf)
├── contigs_filtred.paf # corresponding paf file made using minimap
├── contigs_graph.gfa # symbolic link to raw read provide as input
├── ext_search.csv # read associated to each contig extremity
├── raw_reads_splited.fasta # symbolic link to corrected read provide as input
├── raw_reads_splited.gfa # overlap graph generate by fpa on raw_reads_splited self mapping
├── raw_reads_splited.paf # self mappig of raw_reads_splited
└── read2asm.paf # mapping of read on contigs_filterd
If you use knot in your research, please cite the following publication:
Pierre Marijon, Rayan Chikhi, Jean-Stéphane Varré, Graph analysis of fragmented long-read bacterial genome assemblies, Bioinformatics, btz219, https://doi.org/10.1093/bioinformatics/btz219
@article{Marijon2019,
doi = {10.1093/bioinformatics/btz219},
url = {https://doi.org/10.1093/bioinformatics/btz219},
year = {2019},
month = {mar},
publisher = {Oxford University Press ({OUP})},
author = {Pierre Marijon and Rayan Chikhi and Jean-St{\'{e}}phane Varr{\'{e}}},
editor = {John Hancock},
title = {Graph analysis of fragmented long-read bacterial genome assemblies},
journal = {Bioinformatics}
}